Infrastructure Management Software Blog

OH, IL, WI, IN, MI Operations Center Technical Roadshow - April 20th to April 29th - Don't miss it!

Ever wish you could talk face-to-face with more technical people about Operations Center and Network Management Center products? Don’t really have the time or budget to travel very far to do so?  Well, here is a great opportunity to meet and talk with technical experts on products like Operations Manager and NNMi – right in your background.


Vivit will be hosting a series of six (6) one-day sessions, where there will be a nice mix between presentations and Q&A sessions around these products.  The sessions will be held in the following states on the following days:


- (Columbus) Ohio – April 20, 2010


- (Orrville) Ohio – April 21, 2010


- (Dearborn) Michigan – April 22, 2010


- Wisconsin – April 27, 2010


- (Chicago) Illinois – April 28, 2010


 - (Fishers) Indiana – April 29, 2010


Feel free to contact me if you have any further questions about this roadshow at asksonja@hp.com.

Labels: agent| agentless| agentless monitoring| agents| automating operations management| automation| BES| BlackBerry Enterprise Server| CMDB| consolidate events| consolidated event| Consolidated Event and Performance Management| consolidated event management| Consolidated Management| correlate events| DDM| Discovery and Dependency Mapping| event console| event consolidation| event correlation| event management| Hewlett Packard| HP Network Node Manager| HP OMi| HP OpenView| HP Operations Center| HP Operations Manager| infrastructure management| infrastructure monitoring| IT dashboard| IT infrastructure management| IT infrastructure monitoring| IT management| manager of managers| managing IT| managing IT infrastructure| managing IT operations| monitoring| Network Management| Network Node Manager| NNM| NNMi| Norm Follett| OM| OMi| OML| OMU| OMU 9.0| OMW| OpenView| OpenView Operations| Operations Center| Operations Manager| Operations Manager i| Operations Manager on Linux| Operations Manager on Unix| Operations Manager on Windows| performance| Performance Agent| performance management| Performance Manager| performance monitoring| SiteScope| Smart Plug-in| Sonja Hickey| SPI| TBEC| Topology Based Event Correlation| topology-based event correlation| virtual server| virtual servers| virtual systems management| virtualization management| Virtualization SPI| virtualization sprawl| virtualization strategy| virtualizationation| virtualized environment| virtualized environments| Virtualized Infrastructure| Vivit

Do you want to get rid of your event consoles?

A guest post by Mike Shaw.
- Peter


Are you in the "event consoles will live forever" camp or the "we want to get of event consoles as soon as possible" camp?


I ask because the world seems to be divided into two camps.


With events from top-down performance monitoring (user experience or tracking business transaction performance), it's relatively easy. You get a KPI violation and you turn the event into a ticket because you know it's service affecting -you are, after all, monitoring at the business service level. Once the incident has been created, however, you need to correlate against events to understand what is causing the business service to have problems (more on that from Michael Procopio in his posts on Problem Isolation).


But what about event(s) from below? Let's imagine you have a SAN, which is used by an active directory server, which is used by MS Exchange. Let's imagine the SAN has a problem . This generates an event for the SAN. The ADS complains and throws an event. The Exchange server throws an event too. We can do a number of event processing things automatically:


We can group related events and figure out what is the causal event. So, in our little example, we can infer that the SAN, ADS and Exchange events are an interrelated group and that the SAN event is the cause of all the trouble.


We can work out the business service impact -we can look upwards in the service dependency map and figure out that the Exchange server is used by N users. We can understand the SLAs resting on the Exchange server and how close we are to jeopardy on those SLAs. We can even figure out if any business processes are using that Exchange server. In other words we can automatically infer the business impact of our event stormlet.


So, we have enough to raise a ticket, don't we? We can raise the ticket against the top level affect business service(s) and we know that we are probably not raising tickets unnecessarily because we have grouped events and found the causal event. And we know the causal event and thus we know where the ticket should be routed.


Or am I being unrealistic? Are we not there yet -do you not trust the system to automatically raise tickets when you have a bottom-up event storm?


I would love to hear your views. Right now, my highly unscientific research has found that Europeans (home of ITIL ?) are more in favor of making the ticket the King/Queen and only using the event console to figure out what happened once the ticket has been raised.
One company wants to go even further. They want to get to the situation where they alwaysautomatically raise a ticket and then automatically route the ticket to the appropriate expert 2nd level group. Now, they may get the 2nd level allocation wrong some times, but if they do, they simply reroute to a 1st level triage group who manually reinvestigate who the ticket should go to. This is "manual triage by exception", if you like.


For HP Business Service Management, Mike Shaw.

Event Correlation: OMi TBEC and Problem Isolation - What's the difference (part 3 of 3)

If you have not done so already, you may want to start with part 1 in this series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-1-of-3.aspx


Read part 2 in the series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-2-of-3.aspx



This is the final part in my 3 post discussion of the event correlation technologies within OMi Topology Based Event Correlation (TBEC) and Problem Isolation. I've been focusing on talking about how TBEC is used and how it helps IT Operations Management staff be more effective and efficient.


In my last post I started to mention why End User Monitoring (EUM) technologies are important - because they are able to monitor business applications from an end user perspective. EUM technologies can detect issues which Infrastructure monitoring might miss.


 


In the example we worked through in the last post I mentioned how EUM can detect a response time issue and alert staff that they need to expedite the investigation of an ongoing incident. This is also where Problem Isolation helps. PI provides the most effective means to gather all of the information that we have regarding possible causes of the response time issue and analyze the most likely cause.


 


For example: Our web based ordering system had eight load balanced web servers connected to the internet. These are where our customers connect. The web server farm communicates back to application, database and email servers on the intranet and the overall system allows customers to search and browse available products, place an order and receive email confirmations on order confirmation and shipping status.


 


The event monitoring system includes monitoring of all of the components. We also have EUM probes in place running test transactions and evaluating response time and availability. The systems are all busy but not overloaded - so we are not seeing any performance alerts from the event monitoring system.


 


A problem arises with two of our eight web servers, and they drop out of the load balanced farm. The operations bridge can see that the problem has happened as they receive events indicating the web server issues. TBEC shows that there are two separate issues, so this is not a cascading failure – and the operations staff can see that these web servers are part of the online ordering service.


 


However, they also know that the web servers are part of redundant infrastructure and there should be plenty of spare capacity in the six remaining load balanced web servers. As they have no other events relating to the online ordering service, they decide to leave the web server issues for a little while as they are busy dealing with some database problems for another business service.


 


The entire transaction load that would normally be spread across eight web servers is now focused on the remaining six. They were already busy but now are being pushed even harder, not enough to cause CPU utilization alerts but enough to increase the time that it takes them to process their component of the customer’s online ordering transactions. As a result, response time, as seen by customers, is terrible. The Operations Bridge are unaware as they see no performance alerts form the event management system.


 


EUM is our backstop here; it will detect the response time issue and raise an alert. This alert – indicating that the response time for the online ordering application is unacceptable – is sent to the Operations Bridge.


 


The Operations Bridge team now know that they need to re-prioritize resources to investigate an ongoing business service impacting issue. And they need to do this as quickly as possible. They need to gather all available information about the affected business service and try to understand why response time has suddenly become unacceptable. This is where Problem Isolation helps.


 


PI works to correlate more than just events. It will pull together data from multiple sources - performance history (resource utilizations), events, even help-desk incidents that have been logged and work to determine the likely issue.


 


So we've come full circle. I spent a lot of time talking about OMi, and events and how an Operations Bridge is assisted by TBEC. But it's not the one and only tool that you need in your bag. Technologies like EUM and PI help catch and diagnose all of the stuff that just cannot be detected by 'simply' )I use that term lightly) monitoring infrastructure.


 


Once again if you want to understand PI better I encourage you to take a look at the posts by Michael Procopio over on the BAC blog.



For HP Operations Center, Jon Haworth.

Everything you wanted to know about OMi... (Q&A from Vivit technical webinar)

Thank you to everyone who attended the Vivit webinar. The recording is now available for viewing on Vivit’s web site. You can also download or view the presentation slides in PDF format. There were many questions from the audience. Jon Haworth and Dave Trout's answers appear below. I have grouped questions by topic.


Product Structure


















Are these 3 different modules to be purchased separately? (topology, event and service views) Yes three different modules. OMi Event Management Foundation is the base product and is a requirement before either of the other two products can be installed. OMi Health Perspective Views and OMi Topology Based Event Correlation are optional modules.
How is the licensing done? There are three separate OMi modules. OMi Event Management Foundation is the base product and is a requirement before either of the other two products can be installed. OMi Health Perspective Views and OMi Topology Based Event Correlation are optional modules. Each module is priced / licensed separately and the pricing model is 'flat' - you purchase the license(s) required and that is all (no CPU or tier or connection based pricing).
How does that scale to thousands of machines? Since we have just introduced OMi, we don't yet have a lot of "real" scalability data to report. However our internal testing so far indicates that OMi can handle the typical event rates handled by OMW/OMU in terms of forwarding events. Like OM today, the scalability of the total solution is not so much limited by how many thousands of machines are being managed but on the total event rate being handled.


Integration with Operations Manager, BAC, UCMDB














































Is there any description about the interface between OM and OMi. There are two interfaces used: 1) Message forwarding from OM to OMi, and 2) Web Services interface for message changes and Topology synchronization.
How is the integration with Operations Manager on Unix? As mentioned during the webinar, OMi requires either OMU or OMW as the event consolidation point for forwarding events into OMi. The event forwarding is configured in OM exactly the same way as if forwarding to another OM server. For message updates and topology synchronization, a Web Services interface is used.
Since it was mentioned it works with both OMU 9.0 and OMW 8.10, does it work with the mentioned SPIs on both platforms ? Yes. We are updating the SPIs to be "OMi ready". What this really means is that we're adding  a little extra information to the event messages (via Custom Message Attributes) to make it 'easier' for OMi to associate a message with the correct CI in the UCMDB and to include specific indicators needed for the TBEC rules in OMi. For OMU 9 we will release some updated SPIs soon which include enhanced discovery - very similar levels of discovery to what OMW has. The discovery subsystem is an area that we enhanced in OMU 9 and we want to be able to use the SPI discovery data as the 'starting point' for populating and maintaining CI and relationship information in the UCMDB - which is what helps to drive the logic in OMi.
How flexible are the integration with BAC products? Are these factory built and need factory to modify due to target environment requirement OMi and BAC use the same UCMDB instance so they are tightly integrated 'out of the box'. OMi is completely built on top of the BAC platform technology. It supports the same security mechanisms, the same HA configuration options, the same user/group definitions, etc. In short, OMi is just like any other BAC "application" that is leveraging the platform.
In the installation guide, it says that one of the requirements is to install the "BSM platform". What exactly do you understand on "BSM platform"? BSM platform means "BAC". OMi 8.10 requires BAC 8.02 as the BSM platform.
Can you run OMi without BSM? No, the BSM platform provides the user interface 'framework' and the runtime UCMDB. OMi plugs into the BSM foundation.
Which security model will take precedence - OMU responsibility matrix or the BAC security for views? OMi security is entirely based on the BAC platform features. Access to OMi views, admin UIs, etc. is all controlled through the standard BAC security features (users/groups, roles, permissions, etc.)
Which security model will take precedence - OMU responsibility matrix or the BAC security for views? OMi security is entirely based on the BAC platform features. Access to OMi views, admin UIs, etc. is all controlled through the standard BAC security features (users/groups, roles, permissions, etc.)
What is the price policy if you have / have not BAC already installed? Having BAC installed makes no difference to the price. OMi includes all components needed (runtime UCMDB etc.) in the license. Pricing is based on a 'flat' price for each of the three modules (see earlier question). You need to contact your local HP sales representative to obtain local pricing.
CI treeview scale? The CI Tree view is basically a UCMDB VIEW/TQL under the covers. TQLs in UCMDB are tuned for VERY efficient retrieval of CI information.


Integration with Ticketing Systems (Service Manager, Service Center)






















How does OMi interact with any ticketing system like Service Manager or Service Center. Will the Ci's health be reflected based on ticket info? In this first release of OMi, there is no direct interaction with a ticketing system. The interaction is driven through the existing OM (OMW or OMU) to Service Manager / Service Center interface. Because OMi synchronizes message changes back to the OM server that it is connected to, trouble tickets can be triggered from that OM server.
How does this interface to Service Manager 7? The interface to SM 7 is driven through the existing OM (OMW or OMU) interface to Service Manager. Because OMi synchronizes message changes back to the OM server that it is connected to, trouble tickets can be triggered from that OM server.
The slides implied "assignment" which looked similar to NNMi. How do the new features of OMi integrate to Service Manager? The concept of assignment is 'internal' to OMi. In many organizations the tier 1 support personnel will deal with non Business Service impacting issues without raising a trouble ticket. NOTE: this is purely dependent on the individual process and organization structure that is selected, we know that a lot of companies work this way to minimize the number of TTs. Some organizations insist that every actionable 'incident' becomes a TT. Where an event is dealt with in OMi then assignment makes sense, where events are forwarded to SM7 or another TT system then assignment will likely take place in the Incident / Helpdesk system.
Will OMi integrate with ITSM (change management app from Front Range)?  Also, I'm assuming that we will need to purchase CMDB for event correlation regardless - is that true?  Cannot comment on the Front Range application. It is likely that an integration may be possible but it would be wise to verify with the vendor what external interfaces they provide for integrating event management systems with their product. No you do not need to purchase UCMDB - we provide a 'free' runtime with OMi.


UCMDB, Discovery and Smart Plug-Ins (SPIs)


































Is it necessary to have UCMDB to have OMi? OMi ships with a "BAC 8.02" media kit. This actually provides the BSM PLatform - including UCMDB - and is licensed using your OMi license key. If you do not have an existing UCMDB then this will provide a runtime UCMDB as part of the OMi product package. If you have an existing BAC 8.02 installed (which includes UCMDB) then you can utilize that for OMi.
Is discovery best done in OMi or uCMDB? All discovery data is maintained in the UCMDB. The 'base' discovery for OMi will be provided by the Smart PlugIns that have been deployed from the OMW or OMU instance that OMi is connected to. Additional discovery data can be added to the UCMDB - for example from NNMi or DDM - and OMi will make use of this discovery data if it exists.
If using DDM for discovery, DDM-Advanced is recommended since it can discover not only hosts but also applications and their relationships.
Can you please tell me if DDMi can be used as a feed? Yes. Servers discovered by DDMi are inserted into UCMDB. However be aware that DDMi does not discover applications and dependencies/relationships. DDM-Advanced is the recommended discovery approach if you plan to use OMi and leverage the TBEC rules in particular.
If uCMDB already has CIs populates by DDM, would the new sources like NNMi , SPIs conflict with them , in other words do we need a clean uCMDB ? No. A clean UCMDB is not required. OMi is designed to work with CIs reqardless of how they are discovered and inserted into the UCMDB. In general, reconciliation of CIs discovered from multiple sources is handled automatically.
Can you clarify what you mean by "we are including these SPIs"? Does this mean it's part of the shrink wrap deliverable with OMi?  What specifically will the virtualization SPI provide?  We were considering another product for that space, but want to hear more about those capabilities. We are not including SPIs with OMi. We are including pre-defined content (event type indicator mappings, health indicators, TBEC correlation rules) for the SPIs that we noted. If you have these SPIs deployed then the time to value when OMi is deployed will be very quick. HP released a SPI for Virtualized Infrastructure monitoring earlier this year. Initial focus is on VMware but we will be providing an update soon with more features. You can contact your HP Software Sales Representative to get more details of the specific functionality provided.
 What is the virtualization SPI? Is it nWorks SPI ? No. HP released the Smart PlugIn for Virtualized Infrastructure early in 2009. This is a HP developed and marketed product.
nWorks is the "SPI" we were considering This is a different SPI and is based on a different architecture (agentless polling). It has no OMi content at present and it will be the responsibility of Nworks / Veeam to provide this.


KPIs (Key Performance Indicators)


















What is a KPI? KPI - means Key Performance Indicators
where do you define the KPIs? OMi provides four KPIs to the BAC platform: Operations Performance, Operations Availability, Unresolved Events, Unassigned Events. These are defined by OMi, not by users. What IS configurable is which Health Indicators (HIs) are assigned to impact either the Operations Performance or Operations Availability KPI for specific CI Types. This is done using the Indicator Manager in OMi.
If the difference is KPI, why data is not collected from PM. Instead I see that the data is collected from OVPA & OV agents. OMi is focused around event processing. Events (alerts) are 'collected' from OVPA and OV agents to enable operations staff to understand what needs to be 'fixed'. PM (Performance Manager) is one tool that can be used to assist in the analysis / diagnosis of performance problems. PM is actually integrated into the OMi user interface.


Topology-Based Event Correlation (TBEC)


























In the slide with "Carol" and "Bill", they applied their knowledge to (I guess) develop some rules?  Is that work that still has to be done manually?  What were they developing - KPIs? No not KPIs. The example is there to show how TBEC rules are simple to create but that the correlation engine 'chains' them together to provide quite complex correlations logic which adapts based on the topology that has been discovered. We (HP) are providing content (Event Type Indicators, Health Indicators, TBEC rules as per "Carol and Bill") for a number of our existing Operations Manager Smart PlugIns with OMi and we will continue to add additional content moving forwards. The example in the slide is there to illustrate the process (simple process) of creating very powerful correlation rules which adapt to changes in the discovered infrastructure. You would only need to undertake this process where HP does not provide out of the box content with OMi.
I have some questions regarding the TBEC, is there any experience regarding the performance?
How many events can be handled by the correlation engine per sec?
The engine is tuned for very high performance. It is basically the same engine that is used in NNMi for correlations.
With topology synchronization with NNMi do you have to have OMi licenses for every node in NNMi as well? ... I.E. if you are using Topology Synchronization with NNMi will it only show the nodes from NNMi that have OMi agents installed? No. All CIs in the UCMDB are visible to OMi. No additional license costs are required for NNMi nodes which are added to the UCMDB.
Which language is used for the correlation rules? And where are the rules defined ? (UCMDB?) TBEC is configured in the OMi Correlation Manager GUI, there is no programming language involved. The rules are based on topology (a View from the UCMDB) and on specific Health Indicators with specific HI values.
Does OMi support the execution of validation routines when closing an Alert/Event that also closes other related items? Not currently out of the box. There are several configurable settings which affect TBEC behavior (e.g. correlation time window, automatic extension of time windows, etc.), but currently this is not one of them. We are considering additional options for the future.


OMi Features


























Scalability, High Availability Cluster Support?  Estimated max seats before going distributed? OMi supports the same cluster/HA features as supported by BAC. For example, you can have multiple gateway servers connected to a clustered Data Processing Server and a remote database server. In this case, OMi software is installed on each of these separate servers (gateways and DPS). In general, the "max seats before going distributed" (i.e. adding gateway servers) would be driven by the same considerations as documented for BAC itself. More information specific to OMi environments will be available over time as we have a chance to do further testing and characterization.
Does OMi have a reports generator showing things like daily TBEC, etc.? Not currently. However the BAC reports (e.g. KPIs over Time) can be used to look at how the OMi KPIs are changing over time on CIs.
Comment: We feel that most of these features being discussed in OMi should have been as an upgrade to OMW. Too many modules to buy and try to integrate ourselves. For example we wanted a better version of the OVOWeb to come as an upgrade in OMW8.1. Too many products to buy just to manage our network. OMi is providing discreet and incremental value above and beyond what is provided in OMW or OMU. We are continuing to enhance both OMW and OMU (for example the recent release of OMU 9.0) and customers who are happy with the capabilities of these platforms can continue to move forwards and take advantage of the enhancements that we are providing. There is no requirement to move to OMi.
We feel we are being charged for features that were supposed to be in products that we already purchased. We are not happy about the tactic of releasing new products to fix features that were advertised in prior software. As a consultant, even I get lost in the vast amount of monitoring tools being sold by HP. OMi  is providing discreet and incremental value above and beyond what is provided in OMW or OMU. This functionality was never offered as part of OMW or OMU - it is new and unique to OMi. The reality is that it would have been extremely difficult, and time consuming (slow to release) to provide the high value capabilities of OMi within OMW or OMU. The strategy we have choosen is to base these new capabilities on a 'clean' build based on contemporary technologies - but HP has specifically ensured that existing OM customers who wish to take advantage of these new capabilities can do so without having to disrupt their existing OM installation.
I had some issue when trying to setup and run the synchronization tool and event forwarding. Who can I contact? You should contact your normal HP support channel for assistance.


Other














Is there an estimated time line for detailed technical training on OMi? We have just run a series of virtual instructor led training sessions for our partners. HP Education Services will be releasing an OMi class in the near future.
Where can I get an evaluation version of OMi? You can request a DVD from the trial software web site. A download will be available at http://www.hp.com/go/omi soon.


 


 For HP Operations Center, Peter Spielvogel.


Get the latest updates on our Twitter feed @HPITOps http://twitter.com/HPITOps


Join the HP OpenView & Operations Management group onLinkedIn.



 

Driving down OpEx with technology

It's been a good number of months since I traveled in Europe but I just spent a couple of weeks hopping around on business. There were some interesting changes to the travel experience which got me thinking about parallels in the IT Operations world.
 
First was the check-in experience with Air France / KLM. Basically self-service check-in at a kiosk, with a bag drop. "Nothing new there!" I hear you say. Well what was new was that it was not optional - at least not as far as I could see. There were no check-in agents at desks - only a few floor-walkers minding the kiosks and folks on the bag drop. Obviously this represents a significant cost saving, but that is not my point.


We've all seen this stuff deployed in airports for a while but airlines have been cautious about forcing customers to use it. That was not the case here. All of the regular economy class customers had to use the automated check-in. I guess the airline has overcome (or over-ridden) any fears about the effectiveness of the technology or the danger of impacting the service provided to the customers in favor of some tangible reductions in OpEx.
 
At the other end of the trip I also saw changes in hotel check-out. Now the US has been pretty good at speeding check-out by providing express check-out services. You know the deal, your bill is posted under your door at some horribly early hour of the morning. Your credit card is charged the amount shown unless you decide to go and pursue the regular check-out service. Of course the hotel also benefits because they can deploy less staff to service the check-out transactions. Europe has not adopted this approach. I'm not sure why but I guess it may be some differences in legislation regarding charging someone's credit card without them being present.
 
So the challenge for European hotels is how to maintain quality of service (a rapid check-out) but reduce the OpEx of having lots of staff present during the check-out rush hour in the mornings. A Pullman Hotel in Paris had applied technology to solve this problem. As I headed into the lobby to check out on Friday morning I was ushered towards a bank of (you guessed it) kiosks. The experience was a carbon copy of the airport check-in - just different cards being provided by me and documents being printed by the kiosk. After I had been 'processed' I sat in the lobby waiting for some colleagues to join me. There were three check-out staff that I could see - one on a regular desk duty, two floor-walking the kiosks. The Pullman is a big hotel.
 
Then I started thinking about how the airlines and the hotel were displaying behaviors which are very similar to what we are seeing in the IT Operations space. The drive is to reduce OpEx whilst maintaining service levels. The approach adopted is to use technology to automate activities which have required manual interactions.
The technology is being put into service in spite of any misgivings over its ability to be 100% effective. The companies are willing to take some calculated risks in order to get a demonstrable reduction in OpEx.
 
I see the same behaviors in IT Operations. Forward looking companies are applying technology to automate wherever possible - automate event correlation, automate analysis and problem isolation, automate fixes, automate provisioning. The technology to do a lot of this has been around for years, but previous objections to its deployment - fears over the certainty that the technology will be 100% effective - are being pushed aside as the sights are firmly set on reducing OpEx.
 
For Operations Center, Jon Haworth

Automated Infrastructure Discovery - Extreme Makeover

Good Discovery Can Uncover Hidden Secrets
Infrastructure discovery has something of a bad reputation in some quarters. We've done some recent surveys of companies utilizing a variety of vendors’ IT operations products. What's interesting is that, in our survey results, automated infrastructure discovery fared pretty badly in terms of the support that it received within organizations - and also in terms of the success that they believed they had achieved.
 
There are a number of reasons underlying these survey results. Technology issues and organizational challenges were highlighted in our survey. But I believe that one of the main 'issues' that discovery has is that people have lost sight of its basic values and the benefits that they can bring. Organizations see 'wide reaching' discovery initiatives as complex to implement and maintain - and they do not see compelling short term benefits.
 
I got to thinking about discovery and the path that it has taken over the last 15 or 20 years. I remember the excitement when HP released its first cut of Network Node Manager. It included discovery that showed people things about their networks that they just did not know. There were always surprises when we took NNM into new sites to demonstrate it. Apart from showing folks what was actually connected to the network, NNM also showed how the network was structured, the topology.
 
Visualization --> Association --> Correlation
And once people can see and visualize those two sets of information they start to make associations about how events detected in the network relate to each other - they use the discovery information to optimize their ability to operate the network infrastructure.
 
So the next logical evolution for tools like NNM was to start building some of the analysis into the software as 'correlation'. For example the ability to determine that the 51 "node down" events you just received are actually just one "router down' event and 50 symptoms generated by the nodes that are 'behind' the router in the network topology. Network operators could ignore the 'noise' and focus on the events that were likely causes of outages. Pretty simple stuff (in principle) but very effective at optimizing operational activities.
 
Scroll forward 15 years. Discovery technologies now extend across most aspects of infrastructure and the use cases are much more varied. Certainly inventory maintenance is a key motivator for many organizations - both software and hardware discovery play important roles in supporting asset tracking and license compliance activities. Not hugely exciting for most Operational Management teams.
 
Moving Towards Service Impact Analysis
Service impact analysis is a more significant capability for Operations Management teams and is a goal that many organizations are chasing. Use discovery to find all my infrastructure components - network devices, servers, application and database instances - and tie them together so I can see how my Business Services are using the infrastructure. Then, when I detect an event on a network device or database I can understand which Business Services might be impacted and I can prioritize my operational resources and activities. Some organizations are doing this quite successfully and getting significant benefits in streamlining their operational management activities and aligning them with the priorities of the business.
 
But there is one benefit of discovery which seems to have been left by the side of the road. The network discovery example I started with provides a good reference. Once you know what is 'out there' and how it is connected together then you can use that topology information to understand how failures in one part of the infrastructure can cause 'ghost events' - symptom events' - to be generated by infrastructure components which rely in some way on the errant component. When you get 5 events from a variety of components - storage, database, email server, network devices - then if you know how those components are 'connected' you can relate the events together and determine which are symptoms and which is the likely cause.
 
Optimizing the Operations Bridge
Now, to be fair, many organizations understand that this is important in optimizing their operational management activities. In our survey, we found that many companies deploy skilled people with extensive knowledge of the infrastructure into the first level operations bridge to help make sense of the event stream - try to work out which events to work on and which are dead ends. But it's expensive to do this - and not entirely effective. Operations still end up wasting effort by chasing symptoms before they deal with the actual cause event. Inevitably this increases mean time to repair, increases operational costs and degrades the quality of service delivered to the business.
 
So where is the automation? We added correlation to network monitoring solutions years ago to help do exactly this stuff, why not do 'infrastructure wide' correlation'?
 
Well, it's a more complex problem to solve of course. And there is also the problem that many (most?) organizations just do not have comprehensive discovery across all of their infrastructure. Or if they do have good coverage it's from a variety of tools so it's not in one place where all of the inter-component relationships can be analyzed.
 
Topology Based Event Correlation - Automate Human Judgment
This is exactly the problem which we've been solving with our Topology Based Event Correlation (TBEC)  technology. Back to basics - although the developers would not thank me for saying that, as it's a complex technology. Take events from a variety of sources, do some clever stuff to map them to the discovered components in the discovery database (discovered using a number of discrete tools) and then use the relationships between the discovered components to automatically do what human operators are trying to do manually - indicate the cause event.
 
Doing this stuff automatically for network events made sense 15 years ago, doing it across the complexity of an entire infrastructure makes even more sense today. It eliminates false starts and wasted effort.
 
This is a 'quick win' for Operational Management teams. Improved efficiency, reduced operational costs, free up senior staff to work on other activities… better value delivered to the business (and of course huge pay raises for the Operations Manager).
 
So what do you need to enable TBEC to help streamline your operations. Well, you need events from infrastructure monitoring tools - and most organizations have more than enough of those. But you also need infrastructure discovery information - the more the better.
 
Maybe infrastructure discovery needs a makeover.

 

For HP Operations Center, Jon Haworth


 

Search
Showing results for 
Search instead for 
Do you mean 
HP Blog

HP Software Solutions Blog

Featured


Follow Us
Labels
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.