Infrastructure Management Software Blog

OH, IL, WI, IN, MI Operations Center Technical Roadshow - April 20th to April 29th - Don't miss it!

Ever wish you could talk face-to-face with more technical people about Operations Center and Network Management Center products? Don’t really have the time or budget to travel very far to do so?  Well, here is a great opportunity to meet and talk with technical experts on products like Operations Manager and NNMi – right in your background.


Vivit will be hosting a series of six (6) one-day sessions, where there will be a nice mix between presentations and Q&A sessions around these products.  The sessions will be held in the following states on the following days:


- (Columbus) Ohio – April 20, 2010


- (Orrville) Ohio – April 21, 2010


- (Dearborn) Michigan – April 22, 2010


- Wisconsin – April 27, 2010


- (Chicago) Illinois – April 28, 2010


 - (Fishers) Indiana – April 29, 2010


Feel free to contact me if you have any further questions about this roadshow at asksonja@hp.com.

Labels: agent| agentless| agentless monitoring| agents| automating operations management| automation| BES| BlackBerry Enterprise Server| CMDB| consolidate events| consolidated event| Consolidated Event and Performance Management| consolidated event management| Consolidated Management| correlate events| DDM| Discovery and Dependency Mapping| event console| event consolidation| event correlation| event management| Hewlett Packard| HP Network Node Manager| HP OMi| HP OpenView| HP Operations Center| HP Operations Manager| infrastructure management| infrastructure monitoring| IT dashboard| IT infrastructure management| IT infrastructure monitoring| IT management| manager of managers| managing IT| managing IT infrastructure| managing IT operations| monitoring| Network Management| Network Node Manager| NNM| NNMi| Norm Follett| OM| OMi| OML| OMU| OMU 9.0| OMW| OpenView| OpenView Operations| Operations Center| Operations Manager| Operations Manager i| Operations Manager on Linux| Operations Manager on Unix| Operations Manager on Windows| performance| Performance Agent| performance management| Performance Manager| performance monitoring| SiteScope| Smart Plug-in| Sonja Hickey| SPI| TBEC| Topology Based Event Correlation| topology-based event correlation| virtual server| virtual servers| virtual systems management| virtualization management| Virtualization SPI| virtualization sprawl| virtualization strategy| virtualizationation| virtualized environment| virtualized environments| Virtualized Infrastructure| Vivit

Event Correlation: OMi TBEC and Problem Isolation - What's the difference (part 3 of 3)

If you have not done so already, you may want to start with part 1 in this series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-1-of-3.aspx


Read part 2 in the series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-2-of-3.aspx



This is the final part in my 3 post discussion of the event correlation technologies within OMi Topology Based Event Correlation (TBEC) and Problem Isolation. I've been focusing on talking about how TBEC is used and how it helps IT Operations Management staff be more effective and efficient.


In my last post I started to mention why End User Monitoring (EUM) technologies are important - because they are able to monitor business applications from an end user perspective. EUM technologies can detect issues which Infrastructure monitoring might miss.


 


In the example we worked through in the last post I mentioned how EUM can detect a response time issue and alert staff that they need to expedite the investigation of an ongoing incident. This is also where Problem Isolation helps. PI provides the most effective means to gather all of the information that we have regarding possible causes of the response time issue and analyze the most likely cause.


 


For example: Our web based ordering system had eight load balanced web servers connected to the internet. These are where our customers connect. The web server farm communicates back to application, database and email servers on the intranet and the overall system allows customers to search and browse available products, place an order and receive email confirmations on order confirmation and shipping status.


 


The event monitoring system includes monitoring of all of the components. We also have EUM probes in place running test transactions and evaluating response time and availability. The systems are all busy but not overloaded - so we are not seeing any performance alerts from the event monitoring system.


 


A problem arises with two of our eight web servers, and they drop out of the load balanced farm. The operations bridge can see that the problem has happened as they receive events indicating the web server issues. TBEC shows that there are two separate issues, so this is not a cascading failure – and the operations staff can see that these web servers are part of the online ordering service.


 


However, they also know that the web servers are part of redundant infrastructure and there should be plenty of spare capacity in the six remaining load balanced web servers. As they have no other events relating to the online ordering service, they decide to leave the web server issues for a little while as they are busy dealing with some database problems for another business service.


 


The entire transaction load that would normally be spread across eight web servers is now focused on the remaining six. They were already busy but now are being pushed even harder, not enough to cause CPU utilization alerts but enough to increase the time that it takes them to process their component of the customer’s online ordering transactions. As a result, response time, as seen by customers, is terrible. The Operations Bridge are unaware as they see no performance alerts form the event management system.


 


EUM is our backstop here; it will detect the response time issue and raise an alert. This alert – indicating that the response time for the online ordering application is unacceptable – is sent to the Operations Bridge.


 


The Operations Bridge team now know that they need to re-prioritize resources to investigate an ongoing business service impacting issue. And they need to do this as quickly as possible. They need to gather all available information about the affected business service and try to understand why response time has suddenly become unacceptable. This is where Problem Isolation helps.


 


PI works to correlate more than just events. It will pull together data from multiple sources - performance history (resource utilizations), events, even help-desk incidents that have been logged and work to determine the likely issue.


 


So we've come full circle. I spent a lot of time talking about OMi, and events and how an Operations Bridge is assisted by TBEC. But it's not the one and only tool that you need in your bag. Technologies like EUM and PI help catch and diagnose all of the stuff that just cannot be detected by 'simply' )I use that term lightly) monitoring infrastructure.


 


Once again if you want to understand PI better I encourage you to take a look at the posts by Michael Procopio over on the BAC blog.



For HP Operations Center, Jon Haworth.

Event Correlation: OMi TBEC and Problem Isolation - What's the difference (part 2 of 3)

If you have not done so already, you may want to start with part 1 in this series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-1-of-3.aspx


This is part 2 of 3 of my discussion of the event correlation technologies within OMi Topology Based Event Correlation (TBEC) and Problem Isolation. I'm going to focus on talking about how TBEC is used and how it helps IT Operations Management staff be more effective and efficient. My colleague Michael Procopio has discussed PI in more detail over in the BAC blog here: PI and OMi TBEC blog post 


If you think about an Operations Bridge (or "NOC"… but I've blogged my opinion of that term previously) then fundamentally its purpose is very simple.


 


The Ops Bridge is tasked with monitoring the IT Infrastructure (network, servers, applications, storage etc.) for events and resource exceptions which indicate a potential or actual threat to the delivery of the business services which rely on the IT infrastructure. The goal is to fix issues as quickly as possible in order to reduce the occurrence or duration of business service issues.


 


Event detection is an ongoing process 24x7 and the Ops Bridge will monitor the events during all production periods, often 24x7 using shift based teams.


 


Event monitoring is an inexact discipline. In many cases a single incident in the infrastructure will result in numerous events – only one of which actually relates to the cause of the incident, the other events are just symptoms.


 


The challenge for the Ops Bridge staff is to determine which events they need to investigate and to avoid chasing the symptom events. The operations team must prioritize their activities so that they invest their finite resources in dealing with causal events based on their potential business impact, and avoid wasting time in duplication of effort (chasing symptoms) or, even worse, in chasing symptoms down in a serial fashion before they finally investigate the actual causal event, as this will extend the potential for extended downtime of business services.


 


TBEC helps the Operations Bridge in addressing these challenges. TBEC works 24x7, examining the event stream, relating it to the monitored infrastructure and the automatically discovered dependencies between the monitored components. TBEC works to provide a clear indication that specific events are related to each other (related to a single incident) and to identify which event is the causal event and which are symptoms.


 


Consider a disk free space issue on a SAN, which is hosting an oracle database. With comprehensive event monitoring in place, this will result in three events:



  • a disk space resource utilization alert

  • quickly be followed by an Oracle database application error

  • and a further event which indicates that a Websphere server which uses the Oracle database is unhappy


 


Separately, all three events seem ‘important’ – so considerable time could be wasted in duplicate effort as the Ops Bridge tries to investigate all three events. Even worse, with limited resources, it is quite possible that the Operations staff will chase the events ‘top down’ (serially) – look at Websphere first, then Oracle, and finally the SAN – this extends the time to rectification and increases the duration (or potential) of a business outage.


 


TBEC will clearly show that the event indicating the disk space issue on the SAN is the causal event – and the other two events are symptoms.


 


In a perfect world the Ops Bridge can monitor everything, detect every possible event or compromised resource that might impact a business service and fix everything before a business service impact occurs.


 


The introduction of increasingly redundant and flexible infrastructure helps with this – redundant networks, clustered servers, RAID disk arrays, load balanced web servers etc. But, it also can add complications which I’ll illustrate later.


 


One of the challenges of event monitoring is that it simply can NOT detect everything that can impact business service delivery. For example, think about a complex business transaction, which traverses many components in the IT infrastructure. Monitoring of each of the components involved may indicate that they are heavily utilized – but not loaded to the point where an alert is generated.


 


However, the composite effect on the end to end response time of the business transaction may be such that response time is simply unacceptable. For a web based ordering system where customers connect to a company’s infrastructure and place orders for products this can mean the difference between getting orders or the customer heading over to a competitors web site.


 


This is why End User Monitoring technologies are important. I'll talk about EUM in the next, and final, edition of this blog serial.




Read part 3 in the series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-3-of-3.aspx



For HP Operations Center,  Jon Haworth.

Event Correlation: OMi TBEC and Problem Isolation - What's the difference (part 1 of 3)

 I often get asked questions about the differences between two of the products in our Business Service Management Portfolio; BAC Problem Isolation and OMi Topology Based Event Correlation. Folks seem to get a little confused by some of the high level messaging around these products and gain the impression that the two products "do the same thing".


I guess that, as part of HPs Marketing organization, I have to take some of the blame for this so I'm going to blog my conscience clear (or try to).


To aid brevity I'll use the acronyms PI for Problem Isolation and TBEC to refer to OMi Topology Based Event Correlation.


On the face of it, there are distinct similarities between what PI and TBEC do.



  • Both products try to help operational support personnel to understand the likely CAUSE of an infrastructure or application incident.

  • Both products use correlation technologies (often referred to as event correlation) to achieve their primary goal.



I'll try to summarize the differences in a few sentences.



  • TBEC correlates events (based on discovered topology and dependencies) continuously to indicate the cause event in a group of related events. TBEC is "bottom up" correlation that works even when there is NO business impact - it is driven by IT infrastructure issues.

  • PI correlates data from multiple sources to determine the cause (or causal configuration item) where a business service impacting incident has occurred (. PI performs correlation "on demand" and based on a much broader set of data than TBEC. PI might be considered "tops down" correlation because it starts from the perspective of a business service impacting issue.



In reality, the differences between the products are best explained by looking at how they are used and I'll use my next couple of blog posts to do exactly that for TBEC. If you want the detail on PI then visit this 


PI and OMi in the BAC blog


 post from my colleague, Michael Procopio.  


 Read part 2 in the series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-2-of-3.aspx


Read part 3 in the series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-3-of-3.aspx


For HP Operations Center, Jon Haworth.

Q&A from EMA webinar on incident management and OMi

Thank you to everyone who attended the EMA webinar on “What is New in the Not-so-New Area of Event Management: Five Tips to Reduce Incident Resolution Costs” (view the archived webinar by clicking on the link).


We had many great questions at the end, some of which we did not have time to answer. Here is a complete list of all the questions that were asked, along with the answers. If you have additional questions, please post them in the comment field on the blog.


 


What effect will cloud computing have on the management strategies you discussed?


In many respects, Cloud computing – if it’s to be successful as a responsible answer to optimizing infrastructure for business applications – will accelerate the need for consolidated event management and its associated technologies.  Cloud computing places many new complexities and a stress and real-time awareness in front of IT managers, including how to manage performance, change, and costs effectively across virtualized environments and potentially across a mix of external service providers wedded together in a dynamic ecosystem.  These requirements will force service providers to become more transparent in support of SLAs, performance management, infrastructure discovery, CMDB Systems and CMS involvements, and shared cost analysis, along with compliance, security and risk management issues.  In other words, Cloud computing cannot succeed except as a niche opportunity without embracing the best practices and process-centric programs within IT to optimize its own internal effectiveness.


As you all know, security event management is a domain in its own right, and there is as much interest in cross-domain integration of security processes & tools as in other areas, if not more so in some cases. How can unified event management help security and IT ops team achieve their common goals?


Security event integration with an overall consolidated event management system is one of the more challenging and also more valuable areas of consideration.  This is partly because rather than being a “component-defined” part of the infrastructure or SW environment, security is pervasively associated with all domains and all disciplines.   It is something like the “phantom” in event management-a more logical than tangible entity.  But as such, defining polices for integration and reconciliation are more complex and overall less evolved.  Of course security has its own well established history in event management, in particular with SIEM—but once again this evolved as a way of consolidating security-related event issues, rather than being a more holistic approach to integrating security events with performance and change related events.  And so to a large degree this challenge still remains unanswered by the industry as a whole.


Is OMi a replacement for OM?


No. OMi is a separate product that adds on to Operations Manager. OMi introduces advanced functionality such as system health indicators and topology-based event correlation using Operations Manager as the event consolidation platform. We designed the products in this way to allow our customers to gain significant new capabilities without disrupting their current Operations Manager deployment. There is no rip and replace, just adding a new component on top of the existing monitoring solution.


OMI looks alot like BAC, are they tightly coupled?  Do I need both?


So is BAC and OMi the same product now?


Great observation. OMi is built on the BAC foundation so they do share a common look and feel. OMi performs advanced event management. BAC handles application management, transaction monitoring, and problem isolation. You can mix and match to components from the two product sets to meet the needs of your organization and you only need to purchase the components that fit your needs. So, OMi and BAC are separate products, just tightly integrated.



Sounds great, but what is the cost?  Is there some way to justify the big cash outlay for IT organizations in SMBs?


The return on investment should be apparent. As we covered in the presentation, if you assume the cost per manually handling an event is $75 and OMi will eliminate processing of around 10% of events (conservative estimate), just determine how many events your Operations Bridge team handles per day/week/month/year and do the math.
And, of course, that ignores the benefits associated with a more rapid fix-time for incidents which will enhance business service availability.


For pricing on OMi, please contact your local HP sales representative.


Can OMi run on the same server as Operations Manager?


No. You need to run the two products on different servers. OMi will run on its own Windows based platform and will be connected bi-directionally to a nominated OM server.


Do I need OMi to use the runbook automation capabilities of Operations Orchestration?


No. Operations Orchestration can use the events from Operations Manager as the trigger to launch flows. You do not need OMi too. Like OMi, OO leverages the power of OM and its agents. I strongly recommend you contact your HP sales rep to schedule a demo of Operations Manager and Operations Orchestration working together.


If everyone uses the same console, how will domain experts perform advanced troubleshooting?


The OMi console is designed for Operations Bridge personnel to view events, identify the causal event, and resolve the incident. Likely users will be Tier 1 operators and subject matter experts (SME) starting to troubleshoot problems and determine what to fix. The SMEs will then use their specialized tools to investigate the problems in more detail within their domain. For example, someone on the server team might see that a server is down and then use HP SIM (System Insight Manager) to identify that a fan has stopped working.
OMi includes the concept of “user roles” so that specific users can be provided with access to the events, infrastructure views and tools that are appropriate for their role. Domain experts could have user roles defined which include direct access to tools utilized for advanced troubleshooting.


Is there any special configuration I need to run OMi?


You need Operations Manager to consolidate events before feeding them to OMi. You can feed events from other tools (such as SiteScope for agentless monitoring) into Operations Manager to get better visibility of your enterprise by expanding the number of managed nodes. Operations Manager can also consolidate events from other domain managers such as Microsoft SCOM or IBM Tivoli.
You do need a recent version of Operations Manager – either OMW 8.10 with some specific patches or OMU 9.0. Existing Smart Plug-Ins will work with OMi but we’ve also been making some enhancements to provide tighter integration and to enable the Smart PlugIns for OMU to populate the topology maps automatically. So in general you need a recent OM version and later SPI versions are ‘better’.
Other than that, there is no special configuration.


Does OMi require ECS (event correlation services) to be built out?


No. As a general rule it’s a good idea to ‘refine’ the event stream that is processed by the OM server and passed to OMi. There is absolutely no point in passing lots of noise to OMi – stuff that we know is noise – so we would recommend making good use of all of the traditional event consolidation and filtering technologies in OM. Time and count based correlation on agents, de-duplication etc.
ECS – Event Correlation Services – can also be used to further refine the event stream as it arrives at an OMU server but it is not a requirement for OMi.


Any issues or challenges to be utilize OMi in duplicated IP addresses environment for company like MSP (managed service providers)?


OMi should work in duplicate IP address environments providing that appropriate DNS resolution and IP routing OR HTTP PROXY CHAINING is in place to enable outbound connections from the existing OM server to the managed nodes (agents) to work correctly. The support for dup-IP is something we included in the HTTP communications protocol which can be used with OM agents after version 8.x of the OM servers. There are a number of different ways that the network 'resolution' can be set up - including http proxies and NAT - and we cannot commit to testing every possible configuration. However, with an appropriate configuration OMi will work in these environments. In general, if you have a dup-IP environment working with your existing OM server then OMi should also work.


Does OMi take into consideration HA (high availability) configurations such that it can identify business degradation as opposed to an outage?


Yes. This is one advantage of having health calculation and event correlation which is dynamically driven by the discovery of the infrastructure. Consider a cluster running some Microsoft Exchange Resource Groups, or a number of VMware hosts with some virtual machines which participate in delivering a business service. In either case, if we have a hardware issue then we may move the ‘application’ (resource group or VM) to another host. This may happen automatically.
The Operations Manager Smart Plug-In (SPI) which is monitoring these resources – so the Exchange SPI (which is cluster aware) or the Virtualization Infrastructure SPI – will detect the movement of resources typically within 1 to 2 minutes. The SPI will update the discovery information in OM and this will be synchronized into OMi a short time later. OMi’s perspective of the topology of the infrastructure will change and the health and event correlation rules will adapt.
OMi will now ‘understand’ that the hardware events which arrived from the cluster or VM host do not impact the business service which is supported by the specific Exchange Resource Group or virtual machine.


 


For HP Operations Center, Peter Spielvogel.


Get the latest updates on our Twitter feed @HPITOps http://twitter.com/HPITOps


Join the HP OpenView & Operations Management group onLinkedIn.

Search
Showing results for 
Search instead for 
Do you mean 
HP Blog

HP Software Solutions Blog

Featured


Follow Us
Labels
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.