Using Operations Analytics to solve IT problems

About a year ago, while on a customer visit, we were discussing IT issues and solutions and the following story came up.

The customer told us that one of his primary applications began running slowly a month earlier.  They received some complaints from users and also saw the decrease in performance numbers in the monitoring tools. The slowdown was inconsistent. For a minute response time was fine, and then a minute later it took forever… and it continued like this.  The pace also kept changing at night, when there are far fewer users. The application owner knew there were no recent updates or patches for this application and none of the coming events seemed to be relevant.

After a couple of days (and many hours of investigation) they accidently found it was a server in debug mode that caused all this trouble.

What would have happened if they had Operations Analytics? What could they have done to shorten the time to resolution? What could they have done to reduce the number of hours invested in solving the issue?  Well… a lot!

With HP Operations Analytics they could have viewed a dashboard for this problematic application (you can prepare one for each application up front, or ad-hoc as you need it). Operations Analytics is collecting the data all the time, you can view it and use it whenever you need it.

So when the problem was reported, they could have simply opened the dashboard. By using the dashboard they would have easily seen the rises in response time; as they saw it in their own monitoring tools. But in Operations Analytics they don’t only see response time; they can see availability changes, server metrics, event counts, log messages and more.

But that’s not all. By using the time slider they can easily focus on the time when response time started increasing:

 1 slider.png

 

 

The time slider affects all the dashboard panes. This allows the user to look for changes in other metrics and log messages that happened at the same time.

In our case, they would have found higher disk IO for one of the application servers and that log message rates went up at about the same time. The playback feature can help pinpoint the exact time when the issue started:

2 play.png

 

They can then select the time window when the issue started:

 3 small slider.png

 

For the selected time window, they can now review the log messages that were written. If there are any issues, there is a good chance you can find one or more log messages that explain the root cause. The time-based correlation improves your chances of finding these relevant messages.

Looking at the log messages it is immediately clear that there are more than a few messages with the word “Debug” on the same server with high disk IO. Transactions are using this server inconsistently and therefor the performance was intermittent.  The cause is now clear and it took minutes instead of days to figure it out.

 

4 page.png 

Metrics and Log messages for the application

 

5 page.png

Metrics and Log messages – Focus on the start time

 

6 page.png

Log messages (with DEBUG) – Focus on the start time

 

HP Operations Analytics speeds up the time to resolve business issues with a single pane of glass view. It presents application metrics, system metrics and log messages in one dashboard with a time-based focus, letting you drill down from a performance issue to the logs causing it.

To learn more about Operations Analytics visit us at www.hp.com/go/opsanalytics

 

 

Comments
AnilGarg | ‎01-27-2014 04:19 PM

This is classic IT Operations issue and as stated Analystics plays important role. What I would add is if it had business service impact model and change events feeds (When system/process/application was changed to  debug options), it would reduced the time further or even detected the problem when first abonormality was detected after the change. In IT most of the problems can be directly correlated to change. Therefore in any IT Operations Analytics, collection and association of change event to the issue is very important. There are BSM solution, which will do that.

ALazneeza | ‎05-20-2014 11:37 PM

In any on going operations, Analytics plays a vital role in terms of how to map the collective data into useful information like this. BSM has demonstrated on how to detect issues using performance metrics data, this is what should be sold internally as well for our partner data which are at migration status right now. IT Operation Analytics plays a big role keeping things in order when everything goes LIVE. 

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
Showing results for 
Search instead for 
Do you mean 
About the Author
Architect and User Experience expert with more than 10 years of experience in designing complex applications for all platforms. Currently in...
Featured


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.