03-16-2006 02:32 AM
For the last week or so, I've been noticing that my first backup of the evening has been clocking failures of one or two clients, always within the first ten minutes of the job.
Here's a sample of the error message:
[Major] From: BSM@dppgpo01.anpost.net "GPO1_Incr1" Time: 14/03/2006 18:08:39
[61:3003] Lost connection to VBDA named "I:"
on host ifsnts03.anpost.net.
Ipc subsystem reports: "IPC Write Error
System error:  Connection reset by peer"
Pretty standard, really...
What I find strange is that the problem, although occurring at or about the same time, generally affects different clients.
I've checked with our network team and they're not aware of anything unusual in terms of traffic or overhead.
The job in question is an Incr1 Filesystem backup with 23 clients divided into 101 objects. The job is set for load-balancing and writes to a 4-drive Compaq ESL9190 over a 1GBit link to our alternate datacentre.
The problem first appeared on March 6th and has happened each evening since then. I reset the backup last night to run a half-hour earlier but the problem recurred at the same point in the job.
My Cell Manager is DP5.10 on W2K3, and it and all my clients are fully patched.
I've searched the forum for similar threads, but nothing seems to match my situation.
I'd be interested to see whether anyone has dealt with a problem like this before or had any suggestions...
05-25-2006 02:58 AM
Sorry for not replying sooner...
The keepalive didn't work, unfortunately, however we've traced the problem more closely to the network side of things, and our network team is working to resolve it.
It appears that the amount of backup traffic going over the network is compounded by the data replication jobs run by some of our larger applications, saturating the link.
Thanks for your help.
04-14-2010 07:29 AM
This is Mario Garcia, im a network administrator, I have a similar problem right now, did you fix it?. this issue happened to you many years ago. I also think that this kind of messages alerts about something related to network traffic, I have a cisco network and a checkpoint between server and cellmanager. I would appreciate any comment.
04-14-2010 08:14 AM
As you say, it's been quite a while since I posted this, and in the meantime both my backup and network infrastructures have undergone extensive upgrade and replacement.
I'm now using DP6.11 on a clustered W2K3 SAN-connected environment, and although I very occasionally still see this error, it's infrequent and really isn't as much of an issue as it was when we first began using DP.
I'd suggest switching on 'Reconnect broken connections' within backup scripts to see if that helps, but it's not a setting I use.
Sorry I can't be of more assistance...