OMi TBEC; incremental operational efficiency with minimal administrative burden

As promised I'll summarize how HP Operations Manager i Topology Based Event Correlation addresses the need for advanced and up to date event correlation in dynamic IT environments.

 

The goal of OMi TBEC is to take the event stream being generated from a multitude of cross domain event monitoring technologies and to rationalize the events by helping Operations staffs to understand which events are causes and which are symptoms. Incidentally I'm not going to say that OMi TBEC determines the "root cause" as that conflicts with my ITIL background and the actual definition of "root cause"… OMi TBEC helps to identify causal events - the event that was at the detected starting point for a failure, and which likely resulted in a whole bunch of other symptom events being generated by monitoring tools or applications.

 

The other task that OMi needs to do is to help Operations staff to prioritize which causal events require the most urgent attention - which causal events are having or are most likely to have an impact on business services. To accomplish this OMi needs to be able to associate events with a model of the IT infrastructure that also includes definitions of the Business Services. This allows events to be associated with the Configuration Items (CIs) that they impact - and hence the Business Services that depend on those CIs to be identified. It's actually a bit more sophisticated than just associating events to CIs - there are all kinds of propagation and Business Service criticality definitions, event type and health indicators and KPIs that can be used to assign more 'real world' priority to the way that impact is defined but they are beyond the scope of this blog post.

 

The key point is that there is a model of the IT infrastructure and how it relates to the Business Services underpinning OMi. This model is referred to as the Run-time Service Model (RTSM) and it is fundamental to the way the OMi TBEC does event correlation and also to how OMi helps Operations staff to visualize impact.

 

The RTSM is based on HPs UCMDB technology - but it is NOT a Configuration Management Database, at least not in the traditional sense. The RTSM is specifically tuned to support operation management decision making, it primarily includes only those CIs that are under management - so a lot of the other 'depth' that you would find in a CMDB (stuff that is needed for change planning etc.) is excluded as it adds no value to the immediate tasks of Operations Management. Of course the RTSM can be federated to HPs CMS (the "CMDB" variant of our UCMDB) to provide operations staff with access to the detail if desired - or indeed to third party CMDBs such as Atrium.

 

So the RTSM is optimized to support Operations Management tasks. It is designed to be updated quickly with new discovery information as the IT infrastructure changes and flexes so that the model of the IT infrastructure and the Business Services is as up to date as possible (hence Run-time Service Model). To support this goal the RTSM needs great discovery technologies that are 'dynamic' and this is provided by a large number of products within the HP Business Service Management portfolio. Simply put, any HP monitoring technology that integrates with BSM is doing 'some' discovery…if it is monitoring an IT asset then it is discovering it - and they all update the RTSM in a timely manner…Operations Manager Smart PlugIns, SiteScope, Business Process Monitoring/End User Monitoring probes, Transaction Vision, NNMi, DDM etc.

 

The technologies which monitor dynamic IT infrastructure (clusters, virtualization) have been tuned to provide very rapid updates to the RTSM to keep the model current. Technologies which trace transactions will provide 'real time' updates to business transaction routing. All of these technologies use a common schema and  terms of reference so their individual zones of discovery are seamlessly connected and combined in the RTSM to provide a very dynamic and comprehensive model from Business Services and transactions all the way down to the IT components that support them.

 

With the latest OMi 9.1 software we also provide the ability to consume discovery data, in addition to events,  from third party event providers (e.g. Microsoft SCOM and Nagios) via Integration Adapters.

 

OK so we have a great dynamic model in the RTSM, we also have all of these events coming in from a wide variety of sources - Operations Manager, SiteScope, NNMi, BPM/EUM, third party event providers.

 

OMi TBECs job is to bring together the two sets of information we have about our IT environment - the dynamic model in the RTSM and the events - to help understand what matters.

 

Within the RTSM, each CI has a "CI Type", for example "Oracle database" or "Disk partition", and each discovered instance is assigned its relevant CI Type when it is added as a CI. TBEC correlation rules are defined at the CI Type level. So we may have a rule that says that where a CI instance of  CI Type "Disk Partition" receives an event indicating that there is a shortage of disk space then it may generate events for a CI instance of CI Type "Oracle Database" if the Oracle Database is dependent upon the Disk Partition.

 

As events arrive in OMi, TBEC associates each event with it's CI instance and uses the Topology in the RTSM to determine if any other events that have arrived recently may be related to the event. So we receive a "disk space issue" event and an "Oracle transaction log extension failure" event within a short duration of each other (the sequence does not matter). TBEC can determine that the CI instances that the two events are associated with are 'connected' in the RTSM and there is a TBEC rule for those CI Types that indicates that the "Oracle transaction log extension failure" event is a symptom of the "disk space issue" event.

 

Next we get an "application transaction commit failed" event. In the RTSM the application CI instance is connected to the Oracle database CI Instance. TBEC uses "correlation rule chaining" to relate that new event to the "disk space issue" event as another symptom. So we have ONE causal event ("disk space issue") and two symptoms - "Oracle transaction log extension failure" and "application transaction commit failed". Operations staff can clearly visualize in the OMi user interface which event is the causal event - and of course they also get clear visibility of the impact / priority of the events because the RTSM also shows the impacted Business Service(s).

 

The combination of correlation rules defined between CI Type pairs and correlation rule chaining (across the entire topology in the RTSM) is a simple concept but extremely powerful.

 

Now circling back to discovery. As the various discovery technologies update the RTSM based on infrastructure changes (flexing virtualization, new servers and databases coming online) the existing TBEC rules are assigned to CI pairs based on their CI Types. New Oracle database instances  are connected to their disk partitions (and applications) and TBEC rules are assigned. The constantly updating model in the RTSM also drives a constantly adapting set of TBEC correlation rules - with no administrative effort.

 

To ensure our customers get q quick time to value from OMi and TBEC, HP Software delivers a significant number of TBEC correlation rules out of the box as part of OMi "Content Packs". Content Packs actually include a lot more intellectual property than just TBEC rules - they include all of the configuration aspects needed to tie events to CIs, Health Indicators etc. HP aligns Content Packs with Smart Plug-Ins and Integration Adapters so the list of Content Packs includes all of the usual suspects - Oracle, SQL Server, Active Directory, Exchange server, server Infrastructure etc.

 

Creating your own TBEC rules is also very simple - either use a graphical wizard to associate CI Types and define a rule or (even easier) select events in the OMi Event Browser and create a rule indicating which is the causal event (OMi takes care of validating that the events are associated with CIs that are related in the RTSM model). Once again the focus is on delivering high Operator efficiency with minimal administrative overheads.

 

In summary, the key things to remember about OMi TBEC are as follows.

  1. - The correlation is dynamic. As the model of the IT infrastructure and Business Services in the Run-time Service Model is updated by our discovery technologies the correlation adapts. E.g. as Virtual Guests move around between physical servers, the correlation is updated. NO effort is required by administrators to update the correlation to match infrastructure changes. This is unique!
  2. - We support a wide variety of discovery technologies feeding the RTSM (Smart PlugIns, SiteScope, BPM/EUM probes, NNMi, DDM and third party providers etc.)  so the model is complete and is updated in a timely manner with little administrator effort.
  3. - Because the correlation adapts, the guidance that is provided to Operations staff is as up to date as possible. They can trust it and use it to drive their activities so operations staff are more effective and efficient.
  4. Administration effort is minimal, providing a high value advanced correlation technology that does not incur an untenable burden from an ongoing maintenance perspective.

 

Well I hope you managed to stay with me to the end.. That turned out to be a longer blog post than I had anticipated. As always if you have questions or comments then we love to hear from you.

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
Showing results for 
Search instead for 
Do you mean 
About the Author
Featured


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.