NBU Master is on RHEL 6.4 x86_64. Affected media servers are Win2k3.
# bperror -problems -hoursago 24 | grep 'tar file write error'
1412286931 1 4 16 fdswp00301-bur 796271 796169 0 FDSWV01423 bpbrm from client FDSWV01423: ERR - tar file write error (14)
1412286931 1 4 16 fdswp00305-bur 796254 796133 0 FDSWV01003 bpbrm from client FDSWV01003: ERR - tar file write error (14)
1412286931 1 4 16 fdswp00301-bur 796270 796157 0 FDSWV01458 bpbrm from client FDSWV01458: ERR - tar file write error (14)
1412286931 1 4 16 fdswp00305-bur 796257 796154 0 FDSWV01505 bpbrm from client FDSWV01505: ERR - tar file write error (14)
1412286934 1 4 16 fdswp00301-bur 796269 796155 0 FDSWV01517 bpbrm from client FDSWV01517: ERR - tar file write error (14)
1412287111 1 4 16 fdswp00302-bur 796255 796151 0 FDSWV01471 bpbrm from client FDSWV01471: ERR - tar file write error (14)
1412290751 1 4 16 fdswp00305-bur 796254 796133 0 FDSWV01003 bpbrm from client FDSWV01003: ERR - tar file write error (14)
1412290751 1 4 16 fdswp00303-bur 796281 796153 0 FDSWV01482 bpbrm from client FDSWV01482: ERR - tar file write error (14)
1412290751 1 4 16 fdswp00305-bur 796257 796154 0 FDSWV01505 bpbrm from client FDSWV01505: ERR - tar file write error (14)
1412290753 1 4 16 fdswp00301-bur 796269 796155 0 FDSWV01517 bpbrm from client FDSWV01517: ERR - tar file write error (14)
{etc}
40 failures in the last 24 hours.
I can view a job log with bpdbjobs. Picking one at random (so far, they all look very much alike):
# /usr/openv/netbackup/bin/admincmd/bpdbjobs -jobid 796416 -all_columns | ./reformat
10/02/2014 19:43:21 - Info nbjm(pid=18149) starting backup job (jobid=796416) for client FDSWV00373, policy WIN_VM_305, schedule WIN_VMW_MNTH
10/02/2014 19:43:21 - estimated 16226320 kbytes needed
10/02/2014 19:43:21 - Info nbjm(pid=18149) started backup (backupid=FDSWV00373_1412293401) job for client FDSWV00373, policy WIN_VM_305, schedule WIN_VMW_MNTH on storage unit 305-PureDisk
10/02/2014 19:43:23 - Info bpbrm(pid=6980) FDSWV00373 is the host to backup data from
10/02/2014 19:43:23 - Info bpbrm(pid=6980) reading file list for client
10/02/2014 19:43:23 - Info bpbrm(pid=6980) accelerator enabled
10/02/2014 19:43:24 - Info bpbrm(pid=6980) starting bpbkar32 on client
10/02/2014 19:43:22 - started process bpbrm (6980)
10/02/2014 19:43:24 - connecting
10/02/2014 19:43:24 - connected; connect time: 000:00:00
10/02/2014 19:43:24 - Info bpbkar32(pid=8364) Backup started
10/02/2014 19:43:24 - Info bpbkar32(pid=8364) accelerator enabled backup, archive bit processing:<disabled>
10/02/2014 19:43:25 - Info bptm(pid=8344) start
10/02/2014 19:43:25 - Info bptm(pid=8344) using 61440 data buffer size
10/02/2014 19:43:25 - Info bptm(pid=8344) setting receive network buffer to 246784 bytes
10/02/2014 19:43:25 - Info bptm(pid=8344) using 16 data buffers
10/02/2014 19:43:25 - Info bptm(pid=8344) start backup
10/02/2014 19:43:27 - begin writing
10/02/2014 19:44:05 - Info bpbkar32(pid=8364) INF - Transport Type = san
10/02/2014 19:45:10 - Critical bptm(pid=8344) include image failed
10/02/2014 19:45:10 - Critical bptm(pid=8344) image write failed: error 2060001: one or more invalid arguments
10/02/2014 19:46:23 - Error bptm(pid=8344) cannot write image to disk, Invalid argument
10/02/2014 19:46:23 - Info bptm(pid=8344) EXITING with status 84 <----------
10/02/2014 19:46:23 - Error bpbrm(pid=6980) from client FDSWV00373: ERR - tar file write error (14)
10/02/2014 19:46:25 - Info bpbkar32(pid=8364) accelerator sent 0 bytes out of 0 bytes to server, optimization 0.0%
10/02/2014 19:46:25 - Info bpbkar32(pid=8364) bpbkar waited 5 times for empty buffer, delayed 4394 times.
10/02/2014 19:46:26 - Info fdswp00305-bur(pid=8344) StorageServer=PureDisk:fdswp00305-bur; Report=PDDO Stats for (fdswp00305-bur): scanned: 134414 KB, CR sent: 10118 KB, CR sent over FC: 0 KB, dedup: 92.5%, cache disabled
10/02/2014 19:46:26 - Error bpbrm(pid=6980) could not send server status message
10/02/2014 19:46:26 - Critical bpbrm(pid=6980) unexpected termination of client FDSWV00373
10/02/2014 19:46:27 - Info bpbkar32(pid=0) done. status: 84: media write error
10/02/2014 19:46:27 - end writing; write time: 000:03:00
-
10/02/2014 20:01:27 - Info nbjm(pid=18149) starting backup job (jobid=796416) for client FDSWV00373, policy WIN_VM_305, schedule WIN_VMW_MNTH
10/02/2014 20:01:27 - Info nbjm(pid=18149) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=796416, request id:{6BAD9324-4A90-11E4-8CC2-51EAD643EE46})
10/02/2014 20:01:27 - requesting resource 305-PureDisk
10/02/2014 20:01:27 - requesting resource fdsup00001nbu.NBU_CLIENT.MAXJOBS.FDSWV00373
10/02/2014 20:01:27 - requesting resource fdsup00001nbu.NBU_POLICY.MAXJOBS.WIN_VM_305
10/02/2014 20:01:29 - Info nbrb(pid=18089) Limit has been reached for the logical resource fdsup00001nbu.NBU_POLICY.MAXJOBS.WIN_VM_305
10/02/2014 20:20:10 - Info bpbrm(pid=9092) FDSWV00373 is the host to backup data from
10/02/2014 20:20:10 - Info bpbrm(pid=9092) reading file list for client
10/02/2014 20:20:10 - Info bpbrm(pid=9092) accelerator enabled
10/02/2014 20:20:08 - granted resource fdsup00001nbu.NBU_CLIENT.MAXJOBS.FDSWV00373
10/02/2014 20:20:08 - granted resource fdsup00001nbu.NBU_POLICY.MAXJOBS.WIN_VM_305
10/02/2014 20:20:08 - granted resource MediaID=@aaaap;DiskVolume=PureDiskVolume;DiskPool=305-PureDisk;Path=PureDiskVolume;StorageServer=fdswp00305-bur;MediaServer=fdswp00305-bur
10/02/2014 20:20:08 - granted resource 305-PureDisk
10/02/2014 20:20:08 - estimated 16226320 kbytes needed
10/02/2014 20:20:08 - Info nbjm(pid=18149) started backup (backupid=FDSWV00373_1412295608) job for client FDSWV00373, policy WIN_VM_305, schedule WIN_VMW_MNTH on storage unit 305-PureDisk
10/02/2014 20:20:09 - started process bpbrm (9092)
10/02/2014 20:20:11 - Info bpbrm(pid=9092) starting bpbkar32 on client
10/02/2014 20:20:11 - Info bpbkar32(pid=8728) Backup started
10/02/2014 20:20:11 - Info bpbkar32(pid=8728) accelerator enabled backup, archive bit processing:<disabled>
10/02/2014 20:20:11 - Info bptm(pid=9152) start
10/02/2014 20:20:12 - Info bptm(pid=9152) using 61440 data buffer size
10/02/2014 20:20:12 - Info bptm(pid=9152) setting receive network buffer to 246784 bytes
10/02/2014 20:20:12 - Info bptm(pid=9152) using 16 data buffers
10/02/2014 20:20:11 - connecting
10/02/2014 20:20:11 - connected; connect time: 000:00:00
10/02/2014 20:20:12 - Info bptm(pid=9152) start backup
10/02/2014 20:20:18 - begin writing
10/02/2014 20:20:51 - Info bpbkar32(pid=8728) INF - Transport Type = san
10/02/2014 20:22:25 - Info bptm(pid=9152) waited for full buffer 2037 times, delayed 3985 times
10/02/2014 20:22:26 - Info bpbkar32(pid=8728) accelerator sent 1115131392 bytes out of 14165895168 bytes to server, optimization 92.1%
10/02/2014 20:22:26 - Info bpbkar32(pid=8728) bpbkar waited 548 times for empty buffer, delayed 1106 times.
10/02/2014 20:22:42 - Info bptm(pid=9152) EXITING with status 0 <----------
10/02/2014 20:22:42 - Info fdswp00305-bur(pid=9152) StorageServer=PureDisk:fdswp00305-bur; Report=PDDO Stats for (fdswp00305-bur): scanned: 13841369 KB, CR sent: 488111 KB, CR sent over FC: 0 KB, dedup: 96.5%, cache disabled
10/02/2014 20:22:43 - Info bpbrm(pid=9092) validating image for client FDSWV00373
10/02/2014 20:22:45 - Info bpbkar32(pid=8728) done. status: 0: the requested operation was successfully completed
10/02/2014 20:22:45 - end writing; write time: 000:02:27
The problem is, I don’t know what I am looking at. It reads as if the jobs ran, failed, ran again and succeeded? The first attempt looks like the second except where the ‘granted resource’ messages should be – they are not present.
‘bplist’ shows a catalog entry for 10/2 20:20, which lines up with the 2nd job:
# bplist -C FDSWV00373 -k WIN_VM_305 -b -l -unix_files "/C/"
d--------- user group 0 Oct 02 20:20 /C/
d--------- user group 0 Oct 01 21:57 /C/
d--------- user group 0 Sep 30 21:31 /C/
d--------- user group 0 Sep 29 21:52 /C/
{etc}