Event Correlation: Rationalizing the event stream to see the wood amongst the trees

Following on from my blog post a few weeks back I wanted to take some time to revisit a topic that I see quite a few questions about - event correlation technologies and in particular talk a bit more about Topology Based Event Correlation within the Operations Manager i product (OMi)

 

HPs approach to event monitoring is to distribute the processing and correlation activities throughout the network. The goal of this is to use low cost / low overhead event correlation in as many layers as possible in order to refine the event stream before it becomes consolidated and arrives at the central event console.

Basically we try to reduce the amount of noise that is consolidated - and reduce the resources involved in trying to apply advanced correlation to LOTS of events. 

So as the event pipeline is 'squeezed' using layers of correlation, we start applying more and more aggressive correlation (which requires more resources) to deliver the best guidance to the operations staff. Here is a simple example relating purely to event handling, I'm not talking about automation stuff here (automatic actions / runbooks etc.).. That is another topic...

 

- Operation Manager agent technologies are the first stage – they try to minimize the numbers of alerts sent to the OM server using a variety of smart event processing and correlation technologies. This includes adaptive thresholds, time and count based correlation, de-duplication etc. The goal is to reduce noise being sent “at source”.

 

Because the Operations Manager agents are distributed throughout the infrastructure this represents a very highly scalable architecture. A little low overhead but smart correlation and noise reduction performed at each and every Operations Manager agent adds up to a very significant rationalization of the event stream that is forwarded to the central Operations Manager server. As you monitor more servers you add more Operations Manager agents and increase the ability to correlate the event stream in the depths of the IT infrastructure. And remember, Operations Manager agents can also be used as conduits for event streams from other technologies (including non-HP technologies) so the same smart event correlation can be applied to those event streams also.

 

- Next the Operations Manager server will further rationalize the event stream coming from the agents. It will apply additional correlation technologies (more duplicate analysis, good/bad correlation) and also can leverage Event Correlation Circuits to perform correlation across multiple sources. Once again the goal is to reduce noise and help operations staff to focus on what matters.

 

- The rationalized event stream (from one or more Operations Manager servers) can then be synchronized with OMi – and that is where the really smart Topology Based Event Correlation takes place.

 

- OMi can ALSO consume events from ‘other’ sources – including other HP Software (BPM, RUM, SiteScope, NNMi, Storage Essentials, SIM etc.) AND third party generated events. OMi has an open Interface Adapter that can be configured to consume events from pretty much any source and we include adapters for Microsoft SCOM and Nagios (more to follow soon). This means that the Topology Based Event Correlation (TBEC) discussion that follows can apply to ALL events generated by a monitoring solution comprising technologies from HP (server, network, application, storage etc.) AND third party monitoring solutions.

 

Because we cannot rely on third party providers to 'clean' their event stream OMi can also apply some low cost correlation techniques (in addition to TBEC)  to the events it receives to eliminate duplicates etc.

 

OMi’s biggest differentiator is Topology Based Event Correlation (TBEC). I really cannot stress enough how unique a solution TBEC is.

 

Often when event management vendors (our competitors) talk about "topology based correlation" they are describing technologies based on a snapshot of the topology of the IT infrastructure with a static set of correlation rules attached to the snapshot. As soon as the IT environment topology changes the snapshot and correlation rules are out of date.

 

This means that the guidance that the correlation engine can provide to the operations staff - i.e. what events matter, what to prioritize - is no longer current. Best the operations staff end up chasing false alarms, worst case they miss something that impacts a business service because the correlation rules did not understand the significance of the events.

 

For example, if you have a clustered environment or a dynamic virtual server farm, then as your cluster resources or virtual gusts move around, the ability to understand the impact of (e.g.) a hardware event on that virtualized resource and the business service that it supports is compromised. You really don't know if the hardware event that you just got impacts virtual guest "A" or whether it moved to another virtual server host.

 

The other repercussion of 'static' models is that if you want them to TRY to continue to deliver value to the front line staff then you have to invest a lot of expert time in maintaining the correlation rules etc. We know from talking with customers who are using some of our competitors products that this approach does not scale. Usually customers just end up using the lowest common denominator correlation stuff because that is all that they can trust unless they invest in massive amounts of maintenance.

 

When we talk about TBEC we are referring to a unique solution underpinned by a dynamic model of the IT environment that actively adapts the way event correlation happens as the IT infrastructure flexes - and without a huge administrative / maintenance burden.

 

TBEC can deliver very high quality guidance to IT staffs even in dynamic environments.

 

So how does OMi TBEC work?  I'll blog on that topic tomorrow so tune in for more details.

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
Showing results for 
Search instead for 
Do you mean 
About the Author
Featured


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.