06-22-2014 07:31 PM
My client is upgrading from an ML350G5 to ML350eG8 server in a 15 client environment. Initial trials show that the new server gives significantly slower data access/process times versus the existing.
Even when working directly on the server (ML350eG8), application start times, screen paints and data population are significantly slower on the new server compared with the old one (ML350G5).
Can I define each scenario and ask if anyone can suggest where improvements are most likely to come from?
Old Server: ML350 G5. 2 x Xeon E5345 (QC 2.3GHz 8M 1333MHz). 4GB RAM. Smart Array E200i 128MB with 6 HDDs: 2 lots of 3Drive RAID5 300gb 15k SAS drives giving 558GB each on C:drive and G:drive. WSBS2003R2 on drive C: and Business Critical Application (BCA) also on C: Drive. The second RAID5 drive (558gb) is General File Storage.
New Server: ML350e g8. 2xQC xeon e5-2403, 64GB RAM. B120i/Zero Memory. The are 2 *2TB drives in RAID1 for Drive C: and 2*2TB drives in RAID1 for Drive G:2.
The upgrade plan includes a separation of SBS from mission critical DB onto virtual servers and, hopefully, no loss of performance.
ML350eG8 has Win2008R2 Ent installed on 2TB C:drive. Another Win2008R2 Ent with Mission Critical Software is on HyperV machine on D: (partition on h: drive). Sever WINSBS011 is also on a HyperV machine on the C:drive partition.
- Will the delays I am seeing be from slower HDD (7.2ksata vs 15k sas), from raid setup (Raid 1 vs raid5), or the slower CPU speed (1.8 vs 2.3)?
- Would adding the Battery backed write cache help the B120i much?
- Would I get much improvement by adding a PCI raid controller, drive cage and 10k/15k sas drives in Raid 5?
- Would this sas drive cage add much noise to the unit?
- What other options exist to improve data read times?
06-22-2014 10:47 PM
>Will the delays I am seeing be from slower HDD (7.2k sata vs 15k sas), from raid setup (Raid 1 vs raid5), or the slower CPU speed (1.8 vs 2.3)?
Except for RAID 1, none of these will help. But RAID 1 is typically faster.
And having fewer disks won't help.
Having lots more memory does help, about the only thing in your favor.
06-23-2014 04:37 AM
I just read the Minimum spec for Exchange 2010 was 2ghz processors. I have 1.8ghz. Despite plenty of RAM, could the system be thrashing? Does such a state still exist. I did notice core 1 was constantly at 100% and the page fault rate seemed really high.
06-23-2014 04:54 AM
The disk controllers are going to be a factor. The old box has a Hardware based Smart Array controller with cache memory. The new server has a Software based Smart Array with no cache. Also SAS drives are fast than SATA
Installing the B-series cache module, or install a P-series controller would help
06-23-2014 05:27 AM
I ordered the cache module (and battery which I forgot on the first order) direct from HP. I had already seen data that shows the cache improves disc transfer speeds greatly.
Installed a few hours ago and no improvement.
My very nervous question is: Do i need to increase read/write access with a P controller, or do I need 2 higher performing CPUs????
The base OS (Windows 2008 ENT R2 with only HyperV running) was installed from the intelligent provisioning.
Given the unresolved query of high page faults and a core at 100%, should I reinstall the base OS and re-attach the VHDs?
06-23-2014 07:47 AM
Your old server had CPU's with a higher GHz but it was an older generation processor... I'd say that your actual CPU power is probably not too much different than before.
The additional RAM should be a good thing, but only if your server was running into some memory limitations before. Exchange can use a lot of memory, but it depends on how many mailboxes and how long it's been running so it can start to cache things.
But the biggest change is probably the lowering of the performance on the disk side of things. RAID 1 is faster than RAID 5 if you're comparing the same controller and drives, but you went from 15K SAS drives to 7.2K SATA, and one fewer drive (from 3 in an array to 2). For high disk I/O, you want more spindles.
Also with Exchange, and again, this depends on how many users and all that, you want your databases and logs on different physical disks, just like a SQL server. The MDB files could live on a RAID 5 okay since they tend to be read more, and the EDB log files would prefer a system with faster writes, so RAID 1 or 1+0.
In other words, stuff as many drives into it as you can. For example, rather than 2 x 2TB drives in a RAID 1, if you really need that much space, you'd be better off stuffing it with 15 x 146GB, or 8 x 300GB, or 6 x 600GB in a RAID 5. Though those would be RAID 5, more spindles would mean it's write performance is bound to be better than a single pair of disks in a RAID 1.
Or look at some RAID 50 options too, or if you had the budget, RAID 1+0 is going to seriously improve the write performance.
Having some memory on the array controller itself is a good idea, but given the memory on the server itself I'd imagine that Windows disk caching is probably doing a somewhat decent job, although it won't do any write caching unless you set that on the disks.
Your best overall plan though is to fire up performance monitor and add some basic things to check. CPU, memory usage, etc. For the disks, I always found the disk queue lengths to be a good way to check how Exchange was doing. If it gets too high, your programs are waiting for disk I/O before it can do other things, and that's bad. It's been a while but I remember anything over an average queue length of 2 was a sign of trouble.
06-23-2014 08:05 AM
I hate to say it, but you're probably right that the new server is underpowered in CPU and disk from the old one, with only the RAM update being a definite improvement. Your client needs to spend a bit more. The good news is, the ML350e Gen8 has a LOT of room for upgrades... they just happened to pick the most underpowered model to start with, but it's all better from there.
06-23-2014 08:17 AM
The primary workload is not Exchange 2010, it's just part of SBS201.
Of the 15 cliets, only 5 use Outlook andexchange to any great degree. The internal sharepoint site is more important to the client. However, the mission critical application is on the other virtual HD. Both virtual machines run frustratingly slow.
I wonder if I should rebuild the 2 disk vhd on a different machine and see if I can replicate the delays - that might isolate the issue of hardware versus configuration versus software.
Does anyone have other suggestions?
I am very happy to get advice here.
06-23-2014 02:02 PM
So you have 2 virtual machines running on there. Are both of the VHD files on that same G: drive? And to confirm, "G:" is a RAID 1 with a pair of 2TB SATA drives?
I'm assuming you're using Hyper-V as the host. I guess it depends on what kind of workload is running on those virtual machines. How many CPU's and memory are assigned to each of the virtual systems? You could assign 8 CPU's to each machine... if they both (and the host OS) are all super busy, then yeah, they'll have to schedule their slices, but odds are they're not ALL going to be super busy at the same time too often. And if one virtual is more important than the other, just set the priorities accordingly.
You had mentioned you saw 1 core running at 100% use. Is that one of the 8 cores total then? Is whatever app that's using all that CPU doing something to set the affinity to a single core, or was it spreading it out, just using 12.5% of the total system CPU?
You can probably just look in task manager either on the host system or in the virtual machines to see what's using so much CPU and why it doesn't seem to be a multi-threaded app.
And again, just like with my answer above regarding Exchange, if you have a couple of virtual machines sharing a disk, if they're doing some I/O intensive things, you could be running into some contention. The disk queue length and read/write per second stats in performance monitor can help you get a better idea.
Getting to the right answer depends on being able to figure out where exactly the bottleneck is. My gut instinct tells me it's the disk subsystem, but when you say there's something using 100% of a CPU that means it's either not multi-threaded, or it's affinity was set to a single CPU.
If that were the case, some single-core app is your main thing, then going to a lower speed CPU could very well cause problems. I've seen my fair share of "dumb" non-threaded apps, some of them by HP themselves (I'm looking at you, VCRM). In those cases, you want the fastest speed possible for a single core, irregardless of how many total cores, hyper-threading, etc. the system has.
For a well-behaved multi-threaded app, it's possible to go from a higher speed system to a slightly slower one but with more total cores and still see an overall improvement. You also benefit from the improvements from one generation of Xeon to the next, even at the same clock speeds.
You could check some of the benchmarks on CPU to see how the E5-2403 compares to the E5345. The good folks over at Passmark are a good resource:
E5-2403: 3489 (avg CPU mark)
E5345: 2973 (avg CPU mark)
So in theory, the E5-2403 is a bit faster even though it runs at a slower clock rate. I'm still thinking disk is your slow point.
06-23-2014 03:10 PM
SBS2011 is on drive C. The second server with mission critical application on seperate raid 1 drive.
Hyper V is the host and absolutely nothing else is loaded on the machine.
When looking at the core usage on the virtual machines, they seem normal. The core running at 100% is the first core on the physical machine. I just wonder if this can be to do with the high page faults; could the machine be thrashing?
06-24-2014 09:47 AM
I don't remember if Windows still handles system interrupts with just the first core. If there's a lot of disk thrashing that could account for it. Check out "Process Explorer" from Sysinternals or just the Windows task manager and see if the CPU usage is from a particular program or from interrupt handling.
I'm thinking that in newer versions of Windows, it can handle interrupts on multiple CPUs, smartly, so I don't know if that would explain it, but at least using one of those tools you can see exactly what process is munching away on CPU cycles. If the interrupts are all being caused by the same subsystem, then I guess it would tend to keep the interrupts on the same core. Whenever I've seen high interrupt usage, it's caused by something doing disk thrashing, like the stupid Windows Media Library thing scanning all my videos and songs.
It does seem odd that it's only using 100% of one core. Unless it's "sticky" (affinity) to a particular core, even if it's not a multi-threading app the OS would still spread the load over all the CPU's more or less equally.
For instance, VCRM (version control repository manager) is non-threading and when it's doing catalog validation it will use 100% of a single thread, but on a multi-core system the OS still spreads that out over all the cores. If you have 4 cores, the overall usage is 25% and the graphs for individual cores will show peaks and valleys as the OS slices up the load.