04-15-2014 12:12 PM
We have a backup running for 6 days now. The backup job is for a file server and the speed is less than 1MB/s. Ours is a new implementation/fresh install of DP 8.1 and this is the job that was switched over from Backup Exec. Please refer to the attached document picture.
Server1 is the backup server and cell manager. Has two network interfaces - production and backup VlAN.
Storage1 is the disk based backup system (storeeasy) that has windows 2012 running on it. It has two network interfaces- production and backup VLAN.
The current device configuration
- General:: device type : Backup to Disk, Interface Type : StoreOnce software dedup, Gateway
- Setting:: Media Type : StoreOnce software deduplication, Default Media Pool :
*ADVANCED :: Server-side deduplication is checked, CRC unchecked, Segment size is
- Policies :: Both gateway policies are checked (gateway may be used for restore, gateway may
2. Tape library device
The current backup configuration
To backup the file server, the filesystem spec was used. The file server is windows 2003. Here are the settings:
1. source : FileServer
2. Destination: FileServer_device
3. Options: No options for "Backup specification options", protection is 8 weeks, in Filesystem Options->Advanced->Other->Display statistical info is checked. Also, in WinFS Options -> Use shadow copy and "Backup share information for directories" is checked.
The storage1 has Storeonce component installed. To backup the data to Storage1 through backup vlan, a dns record was created for that network interface and it was imported as a virtual host to the list of clients.
To backup this to a tape library, an object copy automated job was created. However since the file server has been backing up for 6 days, we are not there yet.
Am I missing something??
04-15-2014 01:02 PM - edited 04-15-2014 01:38 PM
just some thoughts:
- Whenever I read something about <1MB/s throughput, the first question is: Can your source feed any faster? The archetypical office file server with lots of small files often can't, as a file system traversal for read is deteriorating into pure random access patterns. So, did Backup Exec read faster? If so, how much faster?
- People often blame the LAN, I found that it is extremely unlikely to be the culprit (disk is way more likely, but for some reasons, nobody wants to believe how slow disk can become when not reading linearly). That being said, the LAN has to be tested. I prefer NetIO tests between the client and the server (media agent box) to tell me true expectable TCP throughput. With 1000Base interfaces, you should see something like 110MB/s in both directions. If you see significantly less than 80MB/s with all of the block sizes, fix the LAN first.
- With lots-of-small-files, an unexpected source of contention used to be the IDB. It was single threaded and when it maxed out the core it run on, everything else came to a screeming halt. That, however, should be over with 8.x since the new IDB scales better and does things differently (file metadata in DCBFs mostly). Look for IDB disk contention anyway, as in practice, theory and practice aren't always the same.
- I would increase disk agent buffers (I always bump that to 32) and I think 50GB segment size might be a little large (even though it shouldn't hurt, it's mainly a question of fast positioning vs. read throughput on physical tape). I tend to use 8GB for that.
From what I know (even with 8.1 of which I know very little), your throughput is not typical and I would start to search for the reason at the source. You need a tool that can monitor storage duty cycle, in 2008R2 just open the task manager and from that call the resource monitor, have a look on the storage tab and check if a device operates beyond 80% saturation. In earlier Windows releases, best use perfmon which opens with a disk usage (Time%) graph automatically.
BTW, sometimes (but not very often) you will find no obvious source of a throughput issue, yet it remains - I've run into one such mystery myself. It's a backup session to tape, involving some TBs of VMDKs from a well performing (RAID 5) properly non-fragemented drive via GE. It typically runs at 90MB/s. There are, however, days when it chooses to run at another speed, and it will keep this slower speed for the entire session (which is several hours to more than a day). Speeds observed so far have been 60MB/s, 40MB/s and in some cases just 12MB/s. I repeat, it is not getting any faster when it once started to run at 12MB/s, even though there is no CPU bound process, no disk saturation and not LAN throughput issue at all when this happens. I do have a theory that involves the RAM the media agent server is equipped with, which isn't entirely optimal (two 16GB modules attached to a 3-channel Xeon), where the layout in memory of the two processes that make up the BMA plays a role. But that's a theory, and like cold dark matter, it's not become less of a mystery by just theorizing. So far the only cure has been babysitting: If it runs too slow, abort the session and start it again. You will hit the proper throughput after some restarts, usually on the first or second try. Once achieved, it will stay fast the same way it always stays slow when it started thusly...
04-15-2014 03:26 PM
netIO test between Server1 and Storage1 resulted in approx 110 MBs for both Rx and Tx.
netIO test between FileServer and Server1 had mixed results. FileServer Rx was approx 100 MBs, however, its Tx averages less than 10 MBs.
FileServer has large number of small files. Backup Exec took 40 hours to backup approx 2 TB of data to a tape library that was hooked to the fileserver locally. It was an old version of backup exec.
FileServer has 2 GB of RAM and it shows over 1GB as free. CPU average in the past two minutes is around 10%. Different volumes in the fileserver are iSCSI LUNs mounted from different SAN nodes.
Should I be changing my NIC card on the file server? Suggestions on how to proceed from here will be appreciated!
04-16-2014 01:35 AM
before drawing conclusions, please try the NetIO test with the heavy asymmetry the other way around (regarding where you start the server and where the client). NetIO sometimes has issues with one of the directions. If the asymmetry is uphold, you have something to investigate further.
The 40h from BE can be considered a baseline, you will probably not get much faster than that. Given BE wrote locally, while DP is pumping the data out the NIC, it will probably interact with the iSCSI (unless that goes via a dedicated SAN NIC) and cause increased latencies, which will likely lengthen the backup a bit. It's less of a throughput issue (iSCSI bulk during backup is incoming, while DP DA to MA bulk stream is outgoing), so duplex should help with that, but it will push on the latencies to some extent.
BTW, if I interpret this correctly, the MA for the StoreOnce Software resides on the StoreEasy machine. Depending on where the gateway is located, you will also need to measure Fileserver --> Storage1, as this might go directly (I would prefer for the gateway to be on the SOS machine itself, else you would uselessly funnel the traffic through the backup server). If FileServer is a 64bit OS supported by SODA, you may also push a media agent there and try source side dedup. That will mostly help with subsequent backups only, though, and may be more sensitive to latencies which in turn may be pushed up by iSCSI.
That being said, the ultimate test tools in your situation are now media agents on FileServer, Server1 and Storage1, equipped with a Null device, and doing backups to the local Null device on FileServer (to establish a baseline on how fast the DA will read from the maze of little files, all alike), to the Null device on Server1 and to the Null device on Storage1. This rules out anything but source and LAN performance and is a good way to corner an issue hiding there, or somewhere else. I would go to measures of changing hardware only when I'm somewhat sure I've diagnosed the culprit correctly.
On how to prep a Null device on several platforms, just search this forum, ISTR that has been discussed here multiple times. And ignore the silly license warnigs popping up because you supposedly now have more tape drives than you are allowed for. The license stuff doesn't really grok WORN (write once read never) devices ;)
04-16-2014 09:57 AM
The tests were performed by starting server in fileserver and client on Server1 first. Then, by starting client in fileserver and server in Server1. Both yielded the same results.
Yes, storeonce component is installed in Storage1. Else, it wouldn't let me create stores there. I wanted to get the first set of backup working hence used the Server1 gateway. I might be using the Storage1 gateway for other backups, provided this works first.
Will go ahead an try the null device option. HP support finally got in touch with me via email and have suggested that I create a standalone file device and a device file /dev/nul on windows system.
Really appreciate your time and effort. Thank you very much! I will update my further findings.
04-23-2014 10:31 AM
Backing to a null device with Server1 as the gateway was the same result. Almost 1 MBs. HP DP support so far has failed to live up to a standard.
I am thinking of activating another port in the fileserver and assigning it to a different vlan. And use that for the backup instead of the port we are using now. But I will see what the support has to say first (giving benefit of doubt) or if we can have someone else take a look at the case.
05-02-2014 12:09 PM
Backing to a null device using media agent from the cell manager was the same.
Backing to null device using media agent installed in the fileserver helped improve the backup performance. But wasn't impressive either. For instance, the first case it took 20 hours to backup 90 GB and in the second case, it took over 4 hours with throughput 6.9MB/s. Old version of Backup Exec running on that machine did a better job.
So, I enabled two other NIC ports in the server (different network card) and assigned them to the backup VLAN (previously it was being backed up through the production network on a different VLAN). Teamed the cards and gave it a different IP. Created a new DNS record for that IP. And imported that DNS name as a Virtual host to the list of clients. Modified the backup spec to select the new imported client instead of the old one. Ran the backup job. The same thing @ 1.06 MB/s.
Apparently, the data is not being pulled out from the new interface. NetIO between the cell manager and fileserver with the new interface works good. This POS software is not sucking the data from the NIC it is supposed to.
06-29-2014 05:37 AM
Its amazing how none of these issues ever seem to have a resolution.. Is there any update here?
I think that's more of an effect of discussion on public forums by people who are too overworked to give feedback when they finally solve an issue, people who were just doing a test installation and moved on elsewhere, people who found their peace with the warts of a certain installation, etc. In cases I investigated myself directly, there is an explanation and often also a solution, most of the time. So far I ran into exactly one as-yet unexplained mystery case of a backup running slow sometimes, I detailed that before in this thread - because it's an exception. I'm convinced that even this is fundamentally an infrastructural problem exposed by the backup, only one I didn't nail yet.
Generally, backup applications, specifically complex, distributed enterprise backup systems, tend to expose all sorts of infrastructure problems that you didn't notice before. This leads to admins confronted with those issues blaming the backup system, when only some of that blame is actually deserved. Issues of LAN design and infrastructure (specifically DNS), attempts to solve them using multihoming to a backup LAN which are extremely hard to get right (specifically when a certain OS from Redmond is involved), an almost religious belief that certain issues cannot exist ("my disks were **bleep** expensive, so they must be fast, no way they are the choke point"), cargo cult ("it's SAN so it must be fast, the thing that is slow is LAN, everybody knows that") etc. pp. all intermingle and create a melange where it's easier to blame the messenger than to fix the underlying issues. The latter can also get impossibly complicated or expensive when infrastructures have grown warts and hairs organically over the years, further and further away from any sane DC and LAN design. IMO that explains 80% or more of the "this **bleep** software is slow" postulates you are confronted with in the wild, specifically in public discussion forums. Sadly it drowns out the really interesting 20% or less, where there actually might be an issue that has to be fixed by the vendor, not by the local IT staff.
That being said, I see one thing so far in this thread that DP might really have a card in: When local backup on a file server to a connected library using some other product is way faster than a VBDA profile of the same source volumes towards a null device, there may be something wrong. That can be demonstrated, and there should be a fix for it from the vendor (that's why we pay for the support). Separating out such cases from the background noise of infrastructural reasons for slow backups is the hard part.
Just some thoughts,