Monitor the job queue, ascertain system bottlenecks - best practice #1

Let's start the new year with a re-reading of the basics :).

 

How many times have you had a red alert situation indicating just high CPU usage? Don't you wish that the alert would be more 'intelligent' to tell you that the problem is only transient and it is not really a bottleneck?

 

In the course of normal working of a computer server, spikes in usage levels are quite normal and in fact good usage of the money spent on the horsepower :) In this blog article I will be discussing how to differentiate these 'spikes' from a real bottleneck situation.

 

bottleneck-img.png

 

Before i go further - what really is a bottleneck?

"bottleneck is a phenomenon where the performance or capacity of an entire system is limited by a single or limited number of components or resources."

Source: wikipedia

 

Ok now that we have that out of the way, why is a bottleneck different from a high usage situation? A high resource usage situation indicates potential problems in the future. A bottleneck situation is a current problem and there's no easy way to restore the system to equilibrium, unless a few 'offending' processes are terminated - Unplanned Outage.

 

For example, if you are just starting up MS outlook on your laptop there's a high cpu usage situation, but it will pass as the initial loading alone takes time. If you are however stuck with a Windows 'hang' situation, that could be due to a bottleneck - the limiting factor there being the number of CPU cores on your laptop.

 

cpu-b1.png

Why does the number of CPU cores matter? The higher the number of cores you have, the more processes that can be executed concurrently on your server. So if you have only a few cores on your servers then you can run only proportionately few processes. If there are too many active processes they are queued to be processed - in the job queue. Also known as the CPU run-queue.

 

If you have a long run-queue exceeding the number of cores on the system, then potentially there's a problem.

If there's a short run-queue near zero, the problem is somewhat different. In this scenario you spent thousands of dollars on a system and it is lying wasted.

 

There's lots and lots of white paper material that will tell you this -

 

For those of you who use HP Operations Manager / Agents, you use the GBL_RUN_QUEUE, GBL_CPU_TOTAL_UTIL and GBL_NUM_CPU metrics to monitor for and detect a CPU bottleneck situation. You would find an example implementation with the HP OM InfraSPI policies (SI-CPUBottleneckDiagnosis policy).

 

Can we do something similar with memory monitoring? Should we? The answer is 'yes' to both. However we look at this a bit differently - we use the pageouts in case of memory to ascertain how much of a resource constraint is persisting along with monitoring memory usage levels. If the memory usage levels are high and the amount of pageouts is high, then there's a good indication of a memory bottleneck. (SI-MemoryBottleneckDiagnosis policy)

 

Let me know if you like this kind of tips for monitoring system performance. While this might be common knowledge, I do get asked this question quite often by customers or people new to system monitoring.

 

Feel free to reach out to me in the comments section below if you have experienced how a bottleneck is different from a high usage situation. I would also love to hear from you if you have any questions related to the topic.

 

Learn more

If you are interested to finding out how the HP System Management suite of products can help you monitor your systems infrastructure, visit HP Operations Manager i software site or the HP System Management sites.

Comments
| ‎01-06-2014 10:19 PM

This really helps!  Looking for more like this :)

Valued Contributor | ‎01-07-2014 06:15 AM

Thank you, Ramkumar. Do you have any further tips for looking at bottlenecks on virtual machines?

Honored Contributor | ‎01-07-2014 06:31 AM

Hi Stefan, yes there are lots of cases where virtualization can throw a curve ball in performance monitoring. I will write a blog article on this soon - thanks for the idea.

Patrik Batsching | ‎01-14-2014 06:28 AM

Hi Ram,

 

thanks a lot for your refreshing blog on CPU/memory monitoring.

 

Are there any additional considerations to care about, if we talk about virtualized environments - meaning we talk about vCPUs and vMEM instead of physical?

Does HP provide corresponding monitoring policies / tools to cover that as well?

 

Best regards,

 

  Patrik

Honored Contributor | ‎01-14-2014 07:32 AM

Hi Patrik, appreciate your reading through this blog post. Thanks for bringing up the point about virtual machines.

 

I plan to cover some detail around monitoring / detecting bottlenecks in virtual systems, in a different article. However the important thing to keep in mind is that metrics like CPU-ready time and other virtualization-induced wait times need to be taken into consideration. We must not forget that there are many problems at the physical layer (the host running the virtual machines) as well as right-sizing of the VMs. For a start, the readers can have a look at related points from one of my earlier blog posts here.

TYOoi(anon) | ‎01-27-2014 08:35 PM

Hi Ram,

 

Good article!!!

 

Just wondering what you pointed out is fr OM Agents. How about Sitescope? Are there any metrics to look out for CPU and Memory bottlenecks if we only have Sitescope?

 

Thanks,

TY

Honored Contributor | ‎01-28-2014 02:20 AM

Hi, yes it is possible to instrument this using something like a custom script monitor in SiteScope. The default monitor for cpu looks at only CPU usage. For detailed monitoring like this, it is better to go with agent-based collection.

 

There's a trade-off that you would typically do with remote monitoring vs agent-based monitoring in general. Remember that even if you did do a remote check with sitescope once every 10 minutes you would get only data collected at that instant of time, and so you might miss a spike that may happened earlier within the 10 minute period.

 

While you can reduce the polling interval, that again increases the logins and logouts on the system - especially on UNIX/linux. there's a definitely a network latency tax.

Christoph Pfister(anon) | ‎02-05-2014 04:56 PM

Hi Ram, nice post!

Josephgreenfield(anon) | ‎02-18-2014 02:51 AM

Thanks for your guidelines. Your link is very helpful. queuing system 

| ‎04-01-2014 05:25 AM

Hi Ram Kumar,

 

It’s really making clear understanding;

Thanks for your post;

 

But wants more like this in Forum;

 

Regards,

Binod

 

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
About the Author
Ramkumar Devanathan (twitter: @rdevanathan) works in the IOM-Customer Assist Team (CAT) providing technical assistance to HP Software pre-sa...


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation