So I've googled, and I've read release notes for 7.6.0.3, and I've read the forums; and I could be wrong, but I haven't seen the exact same problem we're having.
NBU Version: 7.6.0.2
Master & Media Servers: Linux RHEL
So a host will back up, and it will get a "network connection broken" message:
08/22/2014 22:40:39 - Info bpbrm (pid=15539) rchsp01 is the host to backup data from 08/22/2014 22:40:39 - Info bpbrm (pid=15539) reading file list for client 08/22/2014 22:40:39 - Info bpbrm (pid=15539) starting bpbkar on client 08/22/2014 22:40:39 - Info bpbkar (pid=15546) Backup started 08/22/2014 22:40:39 - Info bpbrm (pid=15539) bptm pid: 15547 08/22/2014 22:40:39 - Info bptm (pid=15547) start 08/22/2014 22:40:39 - Info bptm (pid=15547) using 524288 data buffer size 08/22/2014 22:40:39 - Info bptm (pid=15547) setting receive network buffer to 524288 bytes 08/22/2014 22:40:39 - Info bptm (pid=15547) using 64 data buffers 08/22/2014 23:51:44 - Info nbjm (pid=24060) starting backup job (jobid=851492) for client rchsp01, policy vm_prod, schedule daily_differential 08/22/2014 23:51:44 - Info nbjm (pid=24060) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=851492, request id:{2E12BE56-2A81-11E4-BC68-6423C55F448A}) 08/22/2014 23:51:44 - requesting resource rtxdxip01-stg-itg 08/22/2014 23:51:44 - requesting resource kronos.NBU_CLIENT.MAXJOBS.rchsp01 08/22/2014 23:51:44 - requesting resource kronos.NBU_POLICY.MAXJOBS.vm_prod 08/22/2014 23:51:44 - awaiting resource rtxdxip01-stg-itg. Waiting for resources. Reason: Maximum I/O stream count has been reached for the disk volume., Media server: rchnbump2, Robot Type(Number): NONE(N/A), Media ID: N/A, Drive Name: N/A, Volume Pool: NetBackup, Storage Unit: rchnbump2-rtxdxip01, Drive Scan Host: N/A, Disk Pool: rtxdxip01-dp, Disk Volume: rtxdxip01-lsu 08/22/2014 23:52:26 - granted resource kronos.NBU_CLIENT.MAXJOBS.rchsp01 08/22/2014 23:52:26 - granted resource kronos.NBU_POLICY.MAXJOBS.vm_prod 08/22/2014 23:52:26 - granted resource MediaID=@aaaaj;DiskVolume=rtxdxip01-lsu;DiskPool=rtxdxip01-dp;Path=rtxdxip01-lsu;StorageServer=rtxdxip01-ss_167.254.103.89;MediaServer=rchnbump2 08/22/2014 23:52:26 - granted resource rchnbump2-rtxdxip01 08/22/2014 23:52:26 - estimated 2705020 kbytes needed 08/22/2014 23:52:26 - Info nbjm (pid=24060) started backup (backupid=rchsp01_1408769546) job for client rchsp01, policy vm_prod, schedule daily_differential on storage unit rchnbump2-rtxdxip01 08/22/2014 23:52:27 - Info bpbrm (pid=22162) rchsp01 is the host to backup data from 08/22/2014 23:52:27 - Info bpbrm (pid=22162) reading file list for client 08/22/2014 23:52:27 - Info bpbrm (pid=22162) starting bpbkar on client 08/22/2014 23:52:27 - Info bpbkar (pid=22180) Backup started 08/22/2014 23:52:27 - Info bpbrm (pid=22162) bptm pid: 22181 08/22/2014 23:52:27 - started process bpbrm (pid=22162) 08/22/2014 23:52:27 - connecting 08/22/2014 23:52:27 - connected; connect time: 0:00:00 08/22/2014 23:52:28 - Info bptm (pid=22181) start 08/22/2014 23:52:28 - Info bptm (pid=22181) using 524288 data buffer size 08/22/2014 23:52:28 - Info bptm (pid=22181) setting receive network buffer to 524288 bytes 08/22/2014 23:52:28 - Info bptm (pid=22181) using 64 data buffers 08/22/2014 23:52:28 - Info bptm (pid=22181) start backup 08/22/2014 23:52:32 - begin writing 08/22/2014 23:53:18 - Info bpbkar (pid=22180) INF - Transport Type = nbd 08/22/2014 23:57:21 - Info bpbkar (pid=22180) bpbkar waited 146 times for empty buffer, delayed 28237 times 08/22/2014 23:57:22 - Info bptm (pid=22181) waited for full buffer 583 times, delayed 7421 times 08/22/2014 23:57:40 - Info bptm (pid=22181) EXITING with status 0 <---------- 08/22/2014 23:57:41 - Info bpbrm (pid=22162) validating image for client rchsp01 08/22/2014 23:57:41 - Info bpbkar (pid=22180) done. status: 0: the requested operation was successfully completed 08/22/2014 23:57:41 - end writing; write time: 0:05:09 the requested operation was successfully completed (0) |
When I look in vSphere, I see the following messages:
The next backup that runs fails, and from then on fails:
08/23/2014 00:00:00 - Info nbjm (pid=24060) starting backup job (jobid=851678) for client rchsp01, policy vm_prod, schedule daily_differential 08/23/2014 00:00:00 - Info nbjm (pid=24060) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=851678, request id:{55F23DD8-2A82-11E4-800B-6FC1A04377BA}) 08/23/2014 00:00:00 - requesting resource rtxdxip01-stg-itg 08/23/2014 00:00:00 - requesting resource kronos.NBU_CLIENT.MAXJOBS.rchsp01 08/23/2014 00:00:00 - requesting resource kronos.NBU_POLICY.MAXJOBS.vm_prod 08/23/2014 00:00:01 - Info nbrb (pid=23985) Limit has been reached for the logical resource kronos.NBU_POLICY.MAXJOBS.vm_prod 08/23/2014 00:01:30 - awaiting resource rtxdxip01-stg-itg. Waiting for resources. Reason: Maximum I/O stream count has been reached for the disk volume., Media server: rchnbump1, Robot Type(Number): NONE(N/A), Media ID: N/A, Drive Name: N/A, Volume Pool: NetBackup, Storage Unit: rchnbump1-rtxdxip01, Drive Scan Host: N/A, Disk Pool: rtxdxip01-dp, Disk Volume: rtxdxip01-lsu 08/23/2014 01:57:23 - Info nbrb (pid=23985) Limit has been reached for the logical resource kronos.NBU_POLICY.MAXJOBS.vm_prod 08/23/2014 02:00:33 - awaiting resource rtxdxip01-stg-itg. Waiting for resources. Reason: Maximum I/O stream count has been reached for the disk volume., Media server: rchnbump1, Robot Type(Number): NONE(N/A), Media ID: N/A, Drive Name: N/A, Volume Pool: NetBackup, Storage Unit: rchnbump1-rtxdxip01, Drive Scan Host: N/A, Disk Pool: rtxdxip01-dp, Disk Volume: rtxdxip01-lsu 08/23/2014 02:04:18 - Info nbrb (pid=23985) Limit has been reached for the logical resource kronos.NBU_POLICY.MAXJOBS.vm_prod 08/23/2014 02:05:32 - awaiting resource rtxdxip01-stg-itg. Waiting for resources. Reason: Maximum I/O stream count has been reached for the disk volume., Media server: rchnbump1, Robot Type(Number): NONE(N/A), Media ID: N/A, Drive Name: N/A, Volume Pool: NetBackup, Storage Unit: rchnbump1-rtxdxip01, Drive Scan Host: N/A, Disk Pool: rtxdxip01-dp, Disk Volume: rtxdxip01-lsu 08/23/2014 02:06:37 - Info nbrb (pid=23985) Limit has been reached for the logical resource kronos.NBU_POLICY.MAXJOBS.vm_prod 08/23/2014 02:31:10 - awaiting resource rtxdxip01-stg-itg. Waiting for resources. Reason: Maximum I/O stream count has been reached for the disk volume., Media server: rchnbump2, Robot Type(Number): NONE(N/A), Media ID: N/A, Drive Name: N/A, Volume Pool: NetBackup, Storage Unit: rchnbump2-rtxdxip01, Drive Scan Host: N/A, Disk Pool: rtxdxip01-dp, Disk Volume: rtxdxip01-lsu 08/23/2014 02:31:17 - Info nbrb (pid=23985) Limit has been reached for the logical resource kronos.NBU_POLICY.MAXJOBS.vm_prod 08/23/2014 02:32:37 - awaiting resource rtxdxip01-stg-itg. Waiting for resources. Reason: Maximum I/O stream count has been reached for the disk volume., Media server: rchnbump1, Robot Type(Number): NONE(N/A), Media ID: N/A, Drive Name: N/A, Volume Pool: NetBackup, Storage Unit: rchnbump1-rtxdxip01, Drive Scan Host: N/A, Disk Pool: rtxdxip01-dp, Disk Volume: rtxdxip01-lsu 08/23/2014 02:35:41 - Info nbrb (pid=23985) Limit has been reached for the logical resource kronos.NBU_POLICY.MAXJOBS.vm_prod 08/23/2014 02:37:41 - awaiting resource rtxdxip01-stg-itg. Waiting for resources. Reason: Maximum I/O stream count has been reached for the disk volume., Media server: rchnbump2, Robot Type(Number): NONE(N/A), Media ID: N/A, Drive Name: N/A, Volume Pool: NetBackup, Storage Unit: rchnbump2-rtxdxip01, Drive Scan Host: N/A, Disk Pool: rtxdxip01-dp, Disk Volume: rtxdxip01-lsu 08/23/2014 02:38:40 - Info nbrb (pid=23985) Limit has been reached for the logical resource kronos.NBU_POLICY.MAXJOBS.vm_prod 08/23/2014 02:44:52 - granted resource kronos.NBU_CLIENT.MAXJOBS.rchsp01 08/23/2014 02:44:52 - granted resource kronos.NBU_POLICY.MAXJOBS.vm_prod 08/23/2014 02:44:52 - granted resource MediaID=@aaaaj;DiskVolume=rtxdxip01-lsu;DiskPool=rtxdxip01-dp;Path=rtxdxip01-lsu;StorageServer=rtxdxip01-ss_167.254.103.89;MediaServer=rchnbump2 08/23/2014 02:44:52 - granted resource rchnbump2-rtxdxip01 08/23/2014 02:44:52 - estimated 2705020 kbytes needed 08/23/2014 02:44:52 - begin Parent Job 08/23/2014 02:44:52 - begin VMware: Start Notify Script 08/23/2014 02:44:52 - Info RUNCMD (pid=19512) started 08/23/2014 02:44:52 - Info RUNCMD (pid=19512) exiting with status: 0 Operation Status: 0 08/23/2014 02:44:52 - end VMware: Start Notify Script; elapsed time 0:00:00 08/23/2014 02:44:52 - begin VMware: Step By Condition Operation Status: 0 08/23/2014 02:44:52 - end VMware: Step By Condition; elapsed time 0:00:00 08/23/2014 02:44:52 - begin VMware: Read File List Operation Status: 0 08/23/2014 02:44:52 - end VMware: Read File List; elapsed time 0:00:00 08/23/2014 02:44:52 - begin VMware: Create Snapshot 08/23/2014 02:44:52 - started process bpbrm (pid=10687) 08/23/2014 02:44:53 - Info bpbrm (pid=10687) rchsp01 is the host to backup data from 08/23/2014 02:44:53 - Info bpbrm (pid=10687) reading file list for client 08/23/2014 02:44:53 - Info bpbrm (pid=10687) start bpfis on client 08/23/2014 02:44:53 - Info bpbrm (pid=10687) Starting create snapshot processing 08/23/2014 02:44:53 - Info bpfis (pid=10698) Backup started 08/23/2014 02:44:53 - snapshot backup of client rchsp01 using method VMware_v2 08/23/2014 02:44:57 - Info bpbrm (pid=10687) INF - vmwareLogger: WaitForTaskCompleteEx: Unable to access file <unspecified filename> since it is locked <232> 08/23/2014 02:44:57 - Info bpbrm (pid=10687) INF - vmwareLogger: WaitForTaskCompleteEx: SYM_VMC_ERROR: TASK_REACHED_ERROR_STATE 08/23/2014 02:44:57 - Info bpbrm (pid=10687) INF - vmwareLogger: ConsolidateVMDisks: SYM_VMC_ERROR: TASK_REACHED_ERROR_STATE 08/23/2014 02:44:57 - Info bpbrm (pid=10687) INF - vmwareLogger: ConsolidateVMDisksAPI: SYM_VMC_ERROR: TASK_REACHED_ERROR_STATE 08/23/2014 02:45:00 - Critical bpbrm (pid=10687) from client rchsp01: FTL - VMware_freeze: VIXAPI freeze (VMware snapshot) failed with 36: SYM_VMC_TASK_REACHED_ERROR_STATE 08/23/2014 02:45:00 - Critical bpbrm (pid=10687) from client rchsp01: FTL - VMware error received: Unable to access file <unspecified filename> since it is locked 08/23/2014 02:45:01 - Info bpbrm (pid=10687) INF - vmwareLogger: RegisterExtensionAPI: SYM_VMC_ERROR: SOAP_ERROR 08/23/2014 02:45:01 - Info bpbrm (pid=10687) INF - vmwareLogger: SOAP 1.1 fault: "":ServerFaultCode [no subcode] 08/23/2014 02:45:01 - Critical bpbrm (pid=10687) from client rchsp01: FTL - vfm_freeze: method: VMware_v2, type: FIM, function: VMware_v2_freeze 08/23/2014 02:45:01 - Critical bpbrm (pid=10687) from client rchsp01: FTL - 08/23/2014 02:45:01 - Critical bpbrm (pid=10687) from client rchsp01: FTL - vfm_freeze: method: VMware_v2, type: FIM, function: VMware_v2_freeze 08/23/2014 02:45:01 - Critical bpbrm (pid=10687) from client rchsp01: FTL - 08/23/2014 02:45:01 - Critical bpbrm (pid=10687) from client rchsp01: FTL - snapshot processing failed, status 156 08/23/2014 02:45:01 - Critical bpbrm (pid=10687) from client rchsp01: FTL - snapshot creation failed, status 156 08/23/2014 02:45:01 - Warning bpbrm (pid=10687) from client rchsp01: WRN - ALL_LOCAL_DRIVES is not frozen 08/23/2014 02:45:01 - Info bpfis (pid=10698) done. status: 156 08/23/2014 02:45:01 - end VMware: Create Snapshot; elapsed time 0:00:09 08/23/2014 02:45:01 - Info bpfis (pid=10698) done. status: 156: snapshot error encountered 08/23/2014 02:45:01 - end writing Operation Status: 156 08/23/2014 02:45:01 - end Parent Job; elapsed time 0:00:09 08/23/2014 02:45:01 - begin VMware: Stop On Error Operation Status: 0 08/23/2014 02:45:01 - end VMware: Stop On Error; elapsed time 0:00:00 08/23/2014 02:45:01 - begin VMware: Delete Snapshot 08/23/2014 02:45:01 - started process bpbrm (pid=10733) 08/23/2014 02:45:01 - Info bpbrm (pid=10733) Starting delete snapshot processing 08/23/2014 02:45:02 - Info bpfis (pid=10777) Backup started 08/23/2014 02:45:02 - Critical bpbrm (pid=10733) from client rchsp01: cannot open /usr/openv/netbackup/online_util/fi_cntl/bpfis.fim.rchsp01_1408779892.1.0 08/23/2014 02:45:02 - Info bpfis (pid=10777) done. status: 4207 08/23/2014 02:45:02 - end VMware: Delete Snapshot; elapsed time 0:00:01 08/23/2014 02:45:02 - Info bpfis (pid=10777) done. status: 4207: Could not fetch snapshot metadata or state files 08/23/2014 02:45:02 - end writing Operation Status: 4207 Operation Status: 156 snapshot error encountered (156) |
The Windows admin says that snapshots end up being created each backup, but cannot deleted for this host... because of the "locked" file mentioned in the screenshot above.
Has anyone else seen this? I have opened a case as well but was just curious what others have seen...
This only started occuring a couple weeks ago; before that it was fine; about 2-3 fail per week, out of about 150, and they are never the same hosts...