Big Data in Action : The magic of Operations Analytics

BYOD, cloud, shadow IT and increasingly complex IT landscapes all mean that we simply can't put monitors on everything we are asked to manage.

 

So, we need to "record everything" that happens in our IT landscape, otherwise we get weird performance problems we can't diagnose, because by the time we get there, all the evidence has disappeared. But recording everything means we need to be able to store and analyze huge amounts of information, quickly. Enter big data. The solution to this problem is known as Operations Analytics.

 

Let's look at a story under two different sets of conditions. One without Operations Analytics. One with. Needless to say, the one with Operations Analytics turns out well, and everyone is happy.

 

wires.jpgWithout Operations Analytics

As Kamla walks into the office on Monday morning, she can feel that trouble is brewing. She has barely put her stuff down at her desk, when two of her team members walk over. “We’ve had a couple of spikes over the weekend.” The spikes have been the scourge of Kamla’s support team for the last two months -- temporary but severe dips in the performance of her company’s award-winning game, “Crabs.” But when her team looks into it, they can find nothing -- all the monitors show good performance. The first time the spike occurred, the business was sympathetic. Now, that sympathy is running thin.

 

Kamla and her team feel helpless. The game was a skunk-works project, developed by a team of “cool” contractors who went from concept to reality in an amazingly fast time. Some of the services used by Crabs are provided by public cloud, which is one of the reasons for the fast development time. Of course, Kamla can’t very well ask the public cloud provider to put her monitors on their servers.

 

The game runs on smartphones and tablets, and there is no way that Kamla can get monitors on the gamers’ devices either. So while her monitors show good performance during these spikes, Kamla knows that she isn’t seeing the whole story, and unless her guys are able to catch a spike as it occurs and dump as much information as they can, they are just going to have to get rid of the problem by a process of elimination.

 

One of Kamla’s team described the situation well. He said that he feels like a detective who has been called to a murder scene a day after the event. The body has gone, all evidence has been washed away, and all witnesses have departed the scene.  Sadly, the business has no appreciation for Kamla's predicament, and the CEO, who was never the most patient person, is starting to become increasingly short with Kamla.

 

With Operations Analytics

As Kamla walks into the office on Monday morning, she can feel that trouble is brewing. She has barely put her stuff down at her desk, when two members of her team walk over. “Over the weekend, we had a spike -- the performance of the Crabs game fell badly, but just for about eight minutes. We picked it up with our social media scanning -- we got a real slamming, especially on the “Hot Games” web site.”

 

OK - no time for a cup of tea; straight into the Ops bridge and onto the Operations Analytics console. Operations Analytics is a new technology. It stores all the logs, all the events and all the performance steams for the whole of the IT landscape in a big data database. It has to be big data because of the sheer volume of information and because Kamla is going to need to zip backward in time over millions of records a second.

 

They have the time of the spike at 14:45 yesterday, Sunday. So Kamla starts there; she asks the Operations Analytics system to literally turn back the clock, to show her the state of the game at the point of the spike. And then, she slowly goes back from the spike point, minute by minute until the point where the game’s performance is fine.

 

At this point, her team leans in close to the huge Operations Analytics screen. Operations Analytics is using the service model for the game to show all the logs, events and performance information. Carefully, Kamla and her team are taken back thru time until … at 14:23 they notice a strange log entry --  “Crabs : all quiet- cut cloud resource by 6 servers.” Wow - that would do it - cutting the cloud resource allocated to the game by six servers under heavy load on Sunday afternoon would cause it to perform badly.

 

But why, then, does the game speed up again after the spike? When they run forward in time in Operations Analytics, they find a similar log message just after the spike clears; “Crabs : suffering - -increase cloud resource by 6 servers.”

 

So now they have the smoking gun. The crabs program, for some reason known only to itself, is cutting the cloud resource incorrectly. Kamla takes the evidence to the apps team who mutter something about the need to adjust their resource allocation algorithm slightly. A friend of Kamla’s in the testing team later heard a rumor that the resource allocation algorithm was created during an “all-nighter” when two programmers were testing the theory that, fuelled by Red Bull, they could program for 24 hours solid, except for comfort breaks.

 

Summary

In a world of cloud services (onto which we can’t put monitors), and BYOD mobile devices (onto which we can’t put monitors) and applications which IT operations aren’t always fully informed about (onto which IT ops would put monitors if they knew about them) we need a new way of diagnosing problems. We need to record all the logs, the events and the performance information from our IT landscape. We can then use Operations Analytics to literally turn back the clock, using the service hierarchy to only show us all the information pertinent to the application having problems.

 

For more information on HP Operations analytics, please go here. For more information the use of big data in Operations Analytics in general, please go here.

 

Comments
Ian Bromehead(anon) | ‎03-04-2014 03:55 AM

Nice storyboard Mike

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
Showing results for 
Search instead for 
Do you mean 
About the Author
Mike has been with HP for 30 years. Half of that time was in R&D, mainly as an architect. The other 15 years has been spent in product manag...


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation