Consolidated event management challenges in large IT environments

One area which affects many IT organizations as they try to keep a handle on monitoring their IT infrastructure is the sheer volume of events that can be generated by the multitude of different monitoring tools.

 

Monitoring IT elements is pretty straightforward nowadays -  operating systems, servers, applications storage and network devices are all capable of generating 'events' or publishing metrics relating to utilization, which can be used for threshold based event generation.

 

Often element focused consoles are provided to capture and display the events, but having multiple element focused consoles is generally being recognized as inefficient as it does not provide a clear way to monitor the overall performance of the IT infrastructure from the perspective of the Business Services.

 

It is also less than optimal for IT staffs as personnel working on separate element focused monitoring consoles can end up chasing the same incident. Typically an issue with a single IT element will cause multiple elements to generate events because of the tightly connected nature of modern IT infrastructure - so a disk space issue results in events from the disk monitoring tools, the database that cannot expand it's transaction log and the application that was trying to commit a bunch of transactions to the database.  The cause of the issue is the disk space shortage and database or application administrators investigating the events in their standalone consoles are (initially) wasting their time. They will be unable to resolve the issue until the server administrator frees up some disk space.

 

So most organizations realize that consolidating all event monitoring into a single console is the rational approach - and that introduces a number of common challenges.

 

The first challenge is consolidating the events in a consistent manner. Often the central consolidated console can help by taking events which are in various formats and mapping them to a single uniform structure. This means that the events can be displayed in a coherent manner - so operations staff have a common understanding of what the event means, irrespective of its source. Products like HP Operations Manager and OMi are great at doing this kind of consolidation.

 

In general, using less 'separate' element managers to capture events is the best approach because it reduces the complexity of the overall event consolidation architecture - less configuration of event mappings is required, and less maintenance is incurred for the solution as it evolves and components are updated. If multiple element managers are to be used then it is always worth considering only using technologies which have pre-built, vendor provided integrations into the event consolidation platform. HP Software understands the value of this approach (lots of our customers tell us they want to do it)  and that is why we are releasing a series of pre-built integration adapters for third party elements (and discovery data) for OMi. SCOM and Nagios are already available and we will release more in the near future.

 

Once all the events are consolidated then that introduces another challenge. The sheer volume of events can make it hard to see the ones that matter. There are a number of approaches that can be employed to make the consolidated event stream that is presented to operations staffs manageable and I'll discuss some of HP Software's "best practices" here and in a later blog post that I have planned.

 

The first thing to realize with events is that more is not better. Wherever possible monitoring technologies need to filter and consolidate events as close to source as possible. Sending thousands of 'informational' or duplicate events to the central event consolidation platform is pointless - it burns network bandwidth, adds no value to operations staffs and consumes resources on the central console. Technologies such as Operations Manager agents have multiple facilities for reducing event noise - de-duplication, adaptive thresholds, time based correlation etc.

 

In some cases it is not possible to have smart monitoring done out at the IT element - network monitoring is a good example. SNMP still remains one of the primary ways to get event notifications from network devices and they are usually sent with little 'intelligence'. In these kinds of situations the event storm reduction measures need to be taken in the element manager prior to forwarding events to the central event console. NNMi is a good example of an element manager that can do exactly this kind of event reduction.

 

Inevitably there will be some cases where many events arrive at the central event console. Some event storm are just not predictable. In these instances the central console needs to be able to perform event de-duplication to prevent the event browser from being overwhelmed. Operations Manager and OMi both provide facilities to handle these kinds of event storms.

 

HP Software has advocated the concept of an "active state" event browser for a long time. The basic idea is simple, if your monitoring tools can detect that a condition that generated an 'error' event has been rectified then they can send an 'all clear' event. So in the example above more disk space is made available resolving the disk free space issue.  When the all clear event arrives, it acknowledges (clears) the error event from the active state event browser. This reduces the number of events that operators see and prevents staff from investing time in investigating issues that no longer exist. In essence it allows the event browser that operators see to more accurately reflect the current state of the monitored infrastructure.

It also has another benefit. In product like Operations Manager and OMi which have active state event browsers, the status of the business service model in the console is a reflection of the current active events in the event browser. By automatically clearing events that are no longer current, the status of the IT and Business service status that is seen in the Business Service model more accurately reflects the actual state of the monitored environment.

 

All of these techniques are intended to reduce and refine the consolidated event stream that is presented to operations staff for their consideration. Performing event refinement, de-duplication and correlation in a staged manner - starting at the monitored elements and successively applying the techniques described - the overall solution is highly scalable, network loading is reduced and operations staff are able to monitor large scale infrastructure using a consolidated event view.

 

In my next blog post I'll describe how advanced correlation techniques can be applied to the events that actually make it into the consolidated event stream at the central event console. I'll specifically describe the concepts underpinning OMi Topology Based Event Correlation and explain how this enables incremental gains in operations efficiency and improved levels of service delivery to the business.

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
About the Author


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation