Set threshold values correctly for good monitoring since not every day is Sunday - best practice #2

In the world of IT Operations Monitoring, the balance that needs to be established between excessive alerting and lack of good monitoring, is achieved only by setting the right thresholds for alert generation.

 

ITIL definition of Threshold - 'The value of a metric that should cause an alert to be generated or management action to be taken. For example, ‘Priority 1 incident not solved within four hours’, ‘More than five soft disk errors in an hour’, or ‘More than 10 failed changes in a month’. '

 

When the right threshold settings are not in place, critical alerts are not sent when needed—or a lot of false positives are established. In both situations, the line of business starts pointing fingers at the monitoring team and distrusts the central IT or Operations Bridge. In short, it is a worrisome situation for the IT organization and the people in the central IT team have to scramble to correct the situation.

 

First and foremost we should not forget best practice #1 - don't blind yourself monitoring just utilization rates, also watch for long queues. 

 

Second, we must keep in mind that we need to set appropriate thresholds. In most cases thresholds need to be set differently between hosts (e.g. UNIX or Windows or OpenVMS), and between instances of disks and network cards attached to the same host. There is no 'one size fits all' approach here, not any more at least. This forces us to ask ourselves.

 

How do we arrive at the right threshold?

 

In some cases, where the workloads are not important, you simply want to set the basic bar of 'not more than 70 percent utilization for a continuous period of ten minutes/two intervals' - for dev-test environments. There's no need to modify this further, unless special tests/stress loads/new application workloads are being deployed.

 

For important systems the threshold to be set can be derived from past history. If there's a sufficient amount of historical data for the system or a similar 'candidate' system (some other system which is running a similar workload), then the historical data can be used to analyze the trends. This allows you to  ascertain the best thresholds to set for the node. At a basic level, you arrive at just one set of thresholds from a baseline obtained from trend of the performance data.

 

However in actuality you need to set different thresholds for night-time workloads and  day-time workloads. And of course there's that third Saturday of every month when the backup jobs run. 

 

For all of these cases, special thresholds must be applied at the given times. For example you can set the thresholds really high during night—in a way, dumb down or turn off monitoring past evening in 8-to-5 environments. During backup times, the thresholds need to be set high to allow for disk usage increases without sending off alarm bells to the central IT team. This is necessary because since it is known to be transient - backup agent software fills a drive with the backup files, and once backed up, these files are removed. However a “disk full situation” must not go unnoticed as the backup job can fail in such a case. Again, historical trend data is useful here to set the threshold.

 

baseline-chart.png

This is baseline chart drawn using HP Performance Manager - I use the old name HP OVPM for familiarity.

The chart shows a metrics' average value (CPU-Usage % in this case), at different times of day and different days of the week. With this, one can effectively gauge a system's range of values and know when are the 'busy' and 'lean' times.

 

Adaptive Thresholds

 

Then there are machines with constantly high workloads that have trends ranging between 95- and 97 percent. Obviously standard thresholds are not going to work here. For these systems, a baseline thresholding system can be adopted. The good news is that there is no need to set thresholds here. Let monitoring use a learning approach, (based on past history again) but set a baseline derived from past history in each hour of the day. For instance, as we all know the Monday morning workload (when folks are getting back to work after the weekend) on a system is not going to be anything like the Saturday evening workload (when everybody is out partying!) 

 

While calculating the baseline, we look at past history of behavior of the system within a given time window in the last few weeks/months, and allow for a little bit of deviation. If the utilization crosses the variance band, we send an alert reporting a potential abnormality on the system. I encourage you to read the attached document to understand this better.

(Agreed, this is the Swiss lock of monitoring, while the basic monitoring approach is akin to the simple door bolt).

 

NOTE: The adaptive baselining policies work best with high-utilization servers running standard workloads. Deploying these policies on 'test' systems with varied workloads or low utilization rates can cause the adaptive baseline detection algorithm to detect multiple anomalies and send out noisy alerts.

 

Parameterization

Far too many times the thresholds that are set by the central IT team (even if done with the best intent in mind) are not sufficiently matching the user’s needs. Meeting these needs calls for a certain amount of customization, but that is best left in the hands of the app teams or line-of-business folks. This is when the concept of splitting the threshold away from the central monitoring is employed allowing for the thresholds to be modified outside of the monitoring software and the monitoring teams. This is a great feature that can be used in dev-test as well as in production environments.

 

Of course for disks, this becomes important considering a disk may be hosting the data base and doing a lot of IOPS and space consumption. You also have to consider the root file system or /var which is almost always constant in utilization. Again, the threshold settings for the volume hosting the database should be set by the database experts rather than the system admins or even the generic IT monitoring team.

 

The gigabit NIC (network card) cannot and must not have the same thresholds as the standard 10/100 NIC - again, need custom settings per instance here.

 

I would recommend going with parameterization or using baseline thresholding to monitor a performance counter and send alerts. The rule of thumb - if there's enough data to arrive at a baseline, and if the system is a really busy system almost always showing high utilization rates that it is difficult to set integral values for the thresholds, go for adaptive baseline thresholding.

 

 

HP OMi-Monitoring Automation

 

HP OMi's Monitoring Automation provides capabilities to define parameters in monitoring. Using the concept of management templates and monitoring aspects, OMi-MA provides monitoring designer ways to set cascading default vaues for not just thresholds used in monitoring but also beyond that extending into user credentials and other application settings.

 

You can download HP Operations Manager i here.

 

 

 

omi-ma-tuning.png

 

Conclusion

Let me know your thoughts on threshold setting. Feel free to add your comments below.

 

Links

 

Comments
Yogish | ‎03-08-2014 12:10 PM
Very informative article for IT folks managing the Systems and in charge of setting Thresholds. They will benefit a lot from the parametrization to handle specific instances.
Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
Showing results for 
Search instead for 
Do you mean 
About the Author
Ramkumar Devanathan (twitter: @rdevanathan) works in the IOM-Customer Assist Team (CAT) providing technical assistance to HP Software pre-sa...


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation