Hello all,
For some time now we get intermittent backup failures in the exchange 2010 DAG backup.
The failures only occur in the full backup (on the weekend), incremental backups during the week run without problems.
We see an error in the netbackup activity monitor: socket write failed(24)
Job details:
13-Sep-14 9:13:33 PM - begin writing
13-Sep-14 11:29:18 PM - Critical bpbrm(pid=26883) from client dag.xx.xx: FTL - socket write failed
13-Sep-14 11:29:20 PM - Error bptm(pid=27781) media manager terminated by parent process
13-Sep-14 11:44:58 PM - Info bpbkar(pid=14560) done. status: 24: socket write failed
13-Sep-14 11:44:58 PM - end writing; write time: 2:31:25
socket write failed(24)
When looking in the applicatation log on the exchange client we see 2 errors at the time of the failure:
Application Log
13-Sep-14 11:29:18 PM
eventid 401
Instance 1: The physical consistency check has completed, but one or more errors were detected. The consistency check has terminated with error code of -106 (0xffffff96).
eventid 403
Instance 1: The physical consistency check successfully validated 4191658 out of 12526160 pages of database '\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy15\MB014\MB014.edb'. Because some database pages were either not validated or failed validation, the consistency check has been considered unsuccessful.
Netbackup Logging on the Client
In the exchange client bpbkar log we see:
21:13:39.549 [8160.16316] <4> V_Snapshot::V_Snapshot_ExcludeRemoteFiles: INF - Excluding /\\?/Volume{4390bc2e-a934-11e2-8296-005056ac2864}/pagefile.sys
23:29:18.402 [14560.14712] <16> tar_tfi::processException:
An Exception of type [SocketWriteException] has occured at:
Module: @(#) $Source: src/ncf/tfi/lib/TransporterRemote.cpp,v $ $Revision: 1.55 $ , Function: TransporterRemote::write[2](), Line: 338
Module: @(#) $Source: src/ncf/tfi/lib/Packer.cpp,v $ $Revision: 1.91.94.2 $ , Function: Packer::getBuffer(), Line: 653
Module: tar_tfi::getBuffer, Function: D:\NB\NB_7.6.0.3\src\cl\clientpc\util\tar_tfi.cpp, Line: 311
Local Address: [::]:0
Remote Address: [::]:0
OS Error: 10060 (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
)
Expected bytes: 524288
23:29:18.433 [14560.14712] <2> tar_base::V_vTarMsgW: FTL - socket write failed
23:29:18.433 [14560.14712] <4> tar_backup::backup_done_state: INF - number of file directives not found: 0
23:29:18.433 [14560.14712] <4> tar_backup::backup_done_state: INF - number of file directives found: 5
23:29:18.433 [14560.12468] <4> tar_base::keepaliveThread: INF - keepalive thread terminating (reason: WAIT_OBJECT_0)
23:29:18.448 [14560.14712] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)
23:29:18.464 [14560.14712] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 24: socket write failed
23:29:18.464 [14560.14712] <4> tar_backup::backup_done_state: INF - Not waiting for server status
23:29:18.464 [14560.14712] <2> ov_log::V_GlobalLog: ERR - endChksgfilesCCheck:ErrTerm() failed with error code -106.
23:29:18.464 [14560.14712] <2> exchange_shadowcopy_access::V_CloseForRead(): ERR - consistency check failed for 'Microsoft Information Store:\MB014\'
23:29:18.464 [14560.14712] <2> tar_base::V_vTarMsgW: WRN - Exchange Validation for 'Microsoft Information Store:\MB014\' failed. Please refer to the backup and application event logs for more details.
23:29:18.464 [14560.14712] <2> ov_log::V_GlobalLog: ERR - endChksgfilesCCheck:ErrTerm() failed with error code -1029.
23:29:18.480 [14560.14712] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
23:29:18.480 [14560.14712] <4> ov_log::OVLoop: Timestamp
23:29:18.480 [14560.14712] <4> OVStopCmd: INF - EXIT - status = 0
23:29:18.495 [14560.14712] <2> tar_base::V_Close: closing...
23:29:18.495 [14560.14712] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
23:29:18.604 [14560.14712] <2> ov_log::V_GlobalLog: INF - BEDS_Term(): enter - InitFlags:0x00000001
23:31:18.803 [14560.14712] <4> OVShutdown: INF - Finished process
23:31:18.803 [14560.14712] <4> WinMain: INF - Exiting C:\Program Files\Veritas\NetBackup\bin\bpbkar32.exe
Symantec Tech Note
We found:
http://www.symantec.com/business/support/index?page=content&id=TECH136986
and we set the shadow copy to: "No limit". this was set only on the disks with the database (both active and passive)
We did this three weeks ago.
The backups ran fine for two weeks and it looked that this solved the problem.
But no... this weekend the problem was back again.
In all cases when we had the failure we did a rerun of the failed databases an the rerun always ended good.
Our enviroment:
- Master: Windows 2008 R2, Netbackup version 7.6.0.3
- Media servers: Windows 2008 R2, Netbackup version 7.6.0.3
- The policy is set to backup the passive copy and if not available the active copy
- Snapshot method: VSS
- Exchange 2010 DAG, Netbackup version 7.6.0.3
- It is a 4 node DAG
- The Database has an active and pasive copy
- We only backup the Database
Any help on where we can look to resolve this problem would be much appreciated
David