Infrastructure Management Software Blog

Radio interview on IT Operations Management book (recording)

On Wednesday, August 3, 2011, my co-authors Jon Haworth and Sonja Hickey joined me on The Bill Marlow show, a radio show focused on business innovations. Bill interviewed us for about half an hour, asking questions about the IT Operations Management book and how we came up with some of the insights in the book.

 

Listen to internet radio with The Bill Marlow Show on Blog Talk Radio

 

You can listen to a recording of the episode here - and perhaps gain some additional color beyond what’s in the book.

 

For HP Operations Center, Peter Spielvogel.

 

IT Operations Management book promotion radio interview and free ebook offer

Thank you to everyone that has purchased our IT Operations Management book (www.itopsbook.com). I will be appearing on The Bill Marlow Show (http://www.thebillmarlowshow.com) along with my co-authors Jon Haworth and Sonja Hickey on Wednesday August 3 at 10AM Eastern time.

 

As a special promotion to readers of this blog, you can download a FREE copy of the ebook version at www.hpitopsbook.com.

New Operations Manager demo

One consistent piece of feedback in almost every presentation I make on Operations Center, its individual products such as Operations Manager or SiteScope, or Business Service Management overall is the people want to see a demo. This applies for customers, prospects, partners and new hires. So, my colleague Jon Haworth and I have put together an Operations Manager demo.

There is no such thing as agentless monitoring

I often hear people trying to make sense of "agent or agentless" monitoring.

 

The reality is that (for most purposes) there is no such thing as agentless monitoring.

Video that explains the Run-time Service Model in 3 minutes

Watch this 3-minute video to see how a Run-Time Service Model can drive your IT monitoring to the next level, guide you as conditions change, and help you avoid any unwanted impact to your business services.

How to frustrate end users

 


I'm currently on the receiving end of some pretty dismal service quality from my broadband (ADSL) provider. I work from home much of the time so when my broadband is unavailable I really feel unproductive and disconnected. 


 


The issues have been ongoing for a while and relate to some problems with the equipment in the local exchange.


 


I'm not going to delve into great detail or "name and shame" but I did want to take the opportunity to take a step back and use my unhappy experience as a learning opportunity. I think there are a couple of lessons in the interactions that I've had with my ISP which provide great guidance on things to avoid if you don't want to frustrate your end-users as you strive to deliver IT services to your business and your customers.


 


The first thing to avoid is a lack of transparency across the teams who are involved in service delivery. When I call the technical help line to report a red light on my ADSL Router I expect someone to KNOW that there is a problem - and to be able to explain what is being done.


 


Often times this has not been the case.


 


In many cases the help desk don't know that there is an issue. This frustrates me (I'm being used as the monitoring device) but also wastes a lot of time as they tend to follow a process of standard tests before they discover that there is a known problem at the broadband exchange.


 


Bottom line is that if your infrastructure monitoring can detect a service impacting issue then that information should be shared with all the folks who can make use of it. You need to use the information to update your help desk / service management systems so that the folks who are front and center talking to customers appear informed and can provide the customer with reassurance that the IT organization has it's act together.


 


I'm not suggesting that every infrastructure event needs to be visible to the help desk - but if it's got a high probability of affecting service delivery then it has value. And whatever is shared has GOT to be accurate. And the information needs to be updated with current status and estimated fix times so this can be relayed to customers to provide some reassurance that the technicians are dealing with the issue and some expectations on when service will be resumed.


 


We provide interfaces for this stuff for our own monitoring solutions, such as Operations Manager or BAC, into HP Service Manager (and some third party help desk packages) because we believe it's a vital part of how an incident management process connects to service management activities.  This is essential stuff that you need to be able to do as part of what we describe as a Closed Loop Incident Process (CLIP).


 


The second 'thing' that drove me nuts happened mid-afternoon the day before yesterday. I called up because the broadband was down again and the response from the help desk was that the exchange was "having an upgrade".


 


Now two possibilities spring to mind here.


The first (driven by disbelief) is that this is incorrect information - either the status in the fault record is wrong (read earlier comments about sharing accurate information) or the help desk person is trying to fob me off - and that's another big no-no if you want happy customers.


 


The second possibility (also driven by disbelief) is that someone needs to take an ITIL class and understand the basics of configuration and change management as they relate to Service Delivery / Service Management. Taking a broadband exchange offline for 4 hours on a weekday afternoon to perform an upgrade appears, on the face of it, to be a little ill-considered.


 


If you want to understand more about how HP can help with Change and Configuration management then take a look at our Service Manager product.  To be able to plan change effectively you need good, up to date configuration information regarding the CIs (configuration Items) and how they relate to each other and support IT Services. That's not something you can maintain manually - at least not cost effectively - so some automated discovery is essential - and we can help there with our discovery (DDM) and UCMDB technologies.

Economy Down, Virtualization Ratios Up

 


I've been talking to a lot of our customers recently about their HP Software installations and the way they are evolving their IT infrastructures. Inevitably a common thread with most of them has been their adoption of Virtualization.


 


What has surprised me a little is  what seems to be a significant increase in terms of the number of virtual machines (VMs) that folks are aspiring to host on a virtual server.


 


We've done some analysis in this area in the past - and discussed it with folks like VMware.  Roughly speaking, 12 - 18 months ago we were seeing 10:1 (VMs per server) where folks had actually started virtualizing. We expected to see 20:1 this year. 


 


My discussions seem to suggest that 30:1 is a more common goal, with some of our customers targeting 40 or even 60:1 (and in one case 100:1!).


 


There is another 'trend' that my mini-survey seems to reveal. In general the ratios for Intel/AMD based VMs (such as Microsoft Windows on VMware) tend to be higher than for RISC based virtualization. Maybe this is a reflection that the RISC based servers have tended to host heavy duty commercial applications so more resources are needed.


 


So why is this accelerating so rapidly? My personal view is that the economic downturn has encouraged IT executives to be less risk adverse when presented with an opportunity, like virtualization, which could help drive down costs.


Without the intense pressure to reduce costs, virtualization maybe looked like a bit of a risk so folks sat on the fence and waited for others to jump. But when the cost cutting pressure climbed during the downturn, folks were willing to make the leap and embark on virtualization - the perceived risks were over-ridden by the potential upsides.


 


And once folks embarked down the virtualization path they realized it was going to be OK, and I see this as driving two behaviours. The first is the increased ratios already discussed, but the second is that I've heard a number of customers say that "we're virtualizing everything / all new servers will be VMs unless there is a compelling business case for a physical server".


 


"In for a penny in for a pound" as we say in the UK.


 


So what are your virtualization ratio goals?

The best way to manage VMware environments

Virtualization management seems to be the hottest topic for discussion, among both customers, partners, and my product marketing and product management peers. I was recently involved in a conversation with some sales people about why HP’s approach is unique.


The reference platform for virtualization management of VMware environments is vCenter (formerly Virtual Center). But, many customers do not want their virtualization experts to spend their (very expensive) time managing first level events. So, they look to a centralized management console such as Operations Manager to handle events from both the virtual and physical IT infrastructure. This is the value behind a consolidated event and performance management approach.


So, the challenge is how to get information about the virtual infrastructure into central event console. The old way, which we used to do, was to install our agents on the VMware hypervisor. We worked closely with VMware to ensure that it worked and was supportable but customers got nervous because the general advice is "do not install anything in the hypervisor". Obviously, if the hypervisor becomes unstable then all of the virtual machines suffer.


The new, and recommended by VMware, approach is to use the vMA or “vSphere Management Assistant”. The vMA is a pre-build Linux virtual machine. It is built and owned by VMware and is downloaded free of charge from their web site. You run the vMA just like any other virtual machine.





The vMA includes all of the VMware-approved and supported interfaces and APIs to enable access to VMware environment monitoring. It provides access to information such as current configuration of the VMs, fault information and very accurate performance information. This is the new way that VMware wants other management systems to get their information from the hypervisor. vMA provides more granular, fine grained, and real-time information than you can get from vCenter. HP was the first vendor to release a management product using the vMA.


We install our agent and Virtualization SPI onto the vMA and make use of the interfaces. One vMA can provide access to monitoring information from multiple VMware server hosts. The “resolution” of the data that we get with the Virtualization SPI vs. vCenter is really just a reflection on what customers told us they wanted. I'm sure that VMware could provide much of the same detail - certainly for a VMware server - but they did not. Customers told us they wanted more - "Don't just give us what Virtual Center provides, go deeper".


The advantage of using the HP Virtualization SPI and Operations Manager is that you can see very granular fault and performance data for both physical and virtual infrastructure in your Operations Manager console. This means your tier 1 operators can manage events and handle basic triage and remediation functions. This keeps your virtualization experts focused on more strategic tasks, until they need to manage an escalation.


This is just one example of HP’s close ties with VMware. We also have integration between HP Insight Control and vCenter that allows customers to manage both physical and virtual infrastructure through the VMware vCenter console. We announced this capability at VMworld in September
This is aimed at server administrators who want a single expert tool for troubleshooting complex problems that could span the hardware and hypervisor.


For HP Operations Center, Jon Haworth and Peter Spielvogel.


Get the latest updates on our Twitter feed @HPITOps http://twitter.com/HPITOps


Join the HP OpenView & Operations Management group on LinkedIn.

BlackBerry Management Webinar (Empowering a Mobile Workforce)

We have written about the challenges of managing Black Berry Enterprise Server environments, mostly because of all the disparate elements that must work together properly to ensure that people receive their email. For a user to successfully send or receive a message, the following applications must interact smoothly: BlackBerry Enterprise Server, Microsoft Exchange, Microsoft Active Directory, and Microsoft SQL Server.


You can imagine the challenge of troubleshooting performance problems if you do not have a single console from which to manage faults and performance data. When the CEO is calling about his or her email, I would certainly want an easy way to determine when the service will be back online.


If you manage BlackBerry Enterprise Servers, you will certainly want to attend this.


One lucky attendee will win a BlackBerry (actual device will depend on your coverage area)!



TitleEmpowering a Mobile Workforce: A Holistic Approach to Managing your BlackBerry Ecosystem
Date: Wednesday, November 4, 2009
Time: 11 am Pacific / 2 pm Eastern / Check Your Time Zone
Additional Speakers:
- Pierluigi Buonicore, Product Manager, Research in Motion (RIM)
- Jonathan Evans, Product Marketing, Research in Motion (RIM)
- Jon Haworth,Product Marketing Manager, HP


Behind every executive’s BlackBerry is a complex IT infrastructure for delivering messages and mission-critical mobile applications. So, when performance or availability issues occur, the Operations Team has no time to waste in identifying the cause of the problem and fixing it. Using a single event monitoring console improves visibility across the BlackBerry ecosystem and streamlines communications among your subject matter expert teams - enabling faster problem resolution and less downtime.


In this one-hour webinar EMA VP Dennis Drogseth, Research in Motion (RIM) Product Manager Pierluigi Buonicore, Jonathan Evans from RIM Product Marketing, and HP Product Marketing Manager Jon Haworth will explore solutions that will enable a holistic support environment for managing BlackBerry resources across the enterprise.  Topics of discussion will include how to:



  • Identify the most common challenges to enterprise BlackBerry management

  • Utilize best practices, such as ITIL, to increase the supportability of mobile devices

  • Enable prompt problem identification and improved time to resolution

  • Evaluate new solutions that will enable a common interface for managing enterprise BlackBerry support services


This is a can’t-miss event for any organization supporting BlackBerry devices or looking to empower their mobile workforce.
Register now (and have a chance to win a BlackBerry).


For HP Operations Center, Peter Spielvogel.


Get the latest updates on our Twitter feed @HPITOps http://twitter.com/HPITOps


Join the HP OpenView & Operations Management group on LinkedIn.

Operations Manager Basics (product overview videos)

I spent the past two days in a planning meeting with my product marketing peers from different product groups including infrastructure monitoring, application monitoring, network monitoring, CMDB, service management, and IT financial management. We reviewed all our respective product plans and our go to market strategies (you will need to watch during the year to learn what we decided). While everyone had some idea about what high-level problems each product line solves, some people were not familiar with specific Operations Manager functionality, especially the current version's capabilities.


They asked for the fastest and easiest way to come up to speed. After some thought, I pointed them to two videos - one for Operations Manager (focused on consolidated event and performance management) and another for Operations Manager i (focused on advanced event reduction using topology-based event correlation). I have posted the links below.


HP Operations Manager
Peter Spielvogel and Jon Haworth discuss how Operations Manager allows customers to monitor heterogeneous IT environments, reduce management costs, and speed time to problem resolution.

(While the demo is on Operations Manager on Windows (OMW), the functionality is virtually the same for Operations Manager on Linux (OML) and Operations Manager on Unix (OMU).)



HP Operations Manager i
Jon Haworth and Dan Haller talk about increasing IT event processing efficiency with OMi.



If you have additional questions, please let me know.


For HP Operations Center, Peter Spielvogel.


Get the latest updates on our Twitter feed @HPITOps http://twitter.com/HPITOps


Join the HP OpenView & Operations Management group on LinkedIn.


 

Managing Virtualization and BlackBerry Ecosystems (Q&A from Vivit technical webinar)

Thank you to the 90 people who attended the Vivit webinar on What's New In Operations Management: Virtualization & BlackBerry Smart Plug-Ins Demo". Jon Haworth and Dan Haller presented on how to get more from your existing Operations Management environment by adding the new smart plug-ins for Virtualization and BlackBerry. 



If you missed the event, you can view a replay of the webinar on the Vivit web site.



Here are the questions that people asked during the event, along with the answers.































































































Question



Answer



Managing Virtualization - System Support



We have OVPM for UNIX. Does VISPI still only work with OVPM for Windows? Currently we have a work around from HP Support to run with OVPM for UNIX. Will newer versions of VISPI work with OVPM on UNIX?



VI SPI 1.5 is supported with OMU 9.0 - so OMU on Solaris and HP-UX and OM on Linux. We now include Performance Manager (PM) with new installations of OMU 9.



Can you run the virtualization SPI if you only have the operations agent and not the performance agent?



No, you need both agents. Stay tuned for some updates regarding our agents. This issue will become moot on November 1.



Does VM SPI require both an OV agent and an OVPA agent on all the Virtual machines and Hosts?



No. You can choose to monitor the VMs with OM agents if you wish. If no OM agent is installed on a VM then a Target Connector license is required.



Does the agent install on ESX3i version 4 USB drive?



With VI SPI 1.5 the agents are no longer installed on the ESX / ESXi hypervisor - they are installed on the vMA (a virtual management appliance VM supplied by VMware). If your ESX3i version 4 USB drive has a vMA installed or is monitored by a remote vMA then the VI SPI (on the vMA) will be able to monitor it.



Do we have any plan for supporting Solaris and HP virtualized environment?



We are investigating other virtualization technologies. We named this product the Smart Plug-In for Virtualized Infrastructure specifically to indicate that it is not constrained to monitor just (e.g.) VMware.  The first version of the VI SPI supported only VMware ESX. Version 1.5 added Hyper V and ESXi. While we cannot comment on specific product plans, you can follow the trend.



Is a LINUX VM guest required?  We only have Windows guests.



The Linux vMA (Virtual Management Appliance) is provided by VMware - you just download it from their web site. It's pre-built so you do not have to 'know' anything to utilize it. VMware has information on vMA on their web site. http://www.vmware.com/support/developer/vima/



Managing Virtualization - Licensing



When is 1.5 of the VI SPI going to be released



It achieved Manufacturing Release status week of September 28th and is just ready for shipping now.



How much is System Infrastructure SPI?



It is included with the OM agent.



Is there a limitation on the number of virtual machines that can be monitored?



We are suggesting that a single VI SPI is limited to monitoring 200 instances. Each host, guest (VM), resource group and cluster counts as one instance. We have tested considerably more than this but are using 200 as our recommendation.



Do you need a SPI license per ESX host or per vma?



Per VM host (ESX/ESXi or Hyper-V)



Is the VISPI different than the Nworks VMware SPI and can this be used as well?



Yes it is a completely different SPI to the Nworks / Veeam SPI - and has a fundamentally different architecture (high resolution agent based monitoring as opposed to agentless monitoring).  I guess you could use both products but I'm not sure exactly what you would gain.



Managing BlackBerry Ecosystem



How does licensing of BES SPI work?



You purchase on BES SPI license for each BES server that you wish to monitor (physical or virtual). It is a flat price structure (no tiers etc.). Obviously you also need an OM agent for each BES server in order to be able to deploy the BES SPI.



Is the BES SPI available for OM 8.1 or is it only available for OMi?



The BES SPI includes the OMi Content Pack. So you could purchase the BES SPI for your OMW 8.10 system and make full use of its "SPI" functionality or, if you have OMi connected to OMW, you can also make use of the OMi features such as Health Perspective Views and Topology Based Event Correlation.



Is there any reference data that shows how BES performance improves when using SPI?



Not right now although we would expect some of the existing application and service availability / performance improvements for OM to be valid



Can the BlackBerry SPI account for devices that are turned off or out of range when monitoring the number of messages waiting to be sent?



The total queue size of unsent messages and calendar updates that are enqueued for all handhelds is monitored. A report (by device) of pending messages is not available at this time.



Does the HP smart plug-in for BES support a Lotus Notes environment?



Not right now. This capability is under investigation.



Other Operations Management Questions



How do I get more information about the integrated OVO and OVP agent...what is this new agent called?



At present this is just a licensing change. We will provide more information on November 1. We will be executing on some technology updates in the future so keep listening for news.



How often data are collected from devices and how? 



As per all OM message threshold policies the schedule is easily configurable.  Out of the box, some are collected every two minutes, some every 5 minutes and some less often.



For HP Operations Center, Jon Haworth.


Get the latest updates on our Twitter feed @HPITOps http://twitter.com/HPITOps



Join the HP OpenView & Operations Management group onLinkedIn.

Event Correlation: OMi TBEC and Problem Isolation - What's the difference (part 3 of 3)

If you have not done so already, you may want to start with part 1 in this series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-1-of-3.aspx


Read part 2 in the series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-2-of-3.aspx



This is the final part in my 3 post discussion of the event correlation technologies within OMi Topology Based Event Correlation (TBEC) and Problem Isolation. I've been focusing on talking about how TBEC is used and how it helps IT Operations Management staff be more effective and efficient.


In my last post I started to mention why End User Monitoring (EUM) technologies are important - because they are able to monitor business applications from an end user perspective. EUM technologies can detect issues which Infrastructure monitoring might miss.


 


In the example we worked through in the last post I mentioned how EUM can detect a response time issue and alert staff that they need to expedite the investigation of an ongoing incident. This is also where Problem Isolation helps. PI provides the most effective means to gather all of the information that we have regarding possible causes of the response time issue and analyze the most likely cause.


 


For example: Our web based ordering system had eight load balanced web servers connected to the internet. These are where our customers connect. The web server farm communicates back to application, database and email servers on the intranet and the overall system allows customers to search and browse available products, place an order and receive email confirmations on order confirmation and shipping status.


 


The event monitoring system includes monitoring of all of the components. We also have EUM probes in place running test transactions and evaluating response time and availability. The systems are all busy but not overloaded - so we are not seeing any performance alerts from the event monitoring system.


 


A problem arises with two of our eight web servers, and they drop out of the load balanced farm. The operations bridge can see that the problem has happened as they receive events indicating the web server issues. TBEC shows that there are two separate issues, so this is not a cascading failure – and the operations staff can see that these web servers are part of the online ordering service.


 


However, they also know that the web servers are part of redundant infrastructure and there should be plenty of spare capacity in the six remaining load balanced web servers. As they have no other events relating to the online ordering service, they decide to leave the web server issues for a little while as they are busy dealing with some database problems for another business service.


 


The entire transaction load that would normally be spread across eight web servers is now focused on the remaining six. They were already busy but now are being pushed even harder, not enough to cause CPU utilization alerts but enough to increase the time that it takes them to process their component of the customer’s online ordering transactions. As a result, response time, as seen by customers, is terrible. The Operations Bridge are unaware as they see no performance alerts form the event management system.


 


EUM is our backstop here; it will detect the response time issue and raise an alert. This alert – indicating that the response time for the online ordering application is unacceptable – is sent to the Operations Bridge.


 


The Operations Bridge team now know that they need to re-prioritize resources to investigate an ongoing business service impacting issue. And they need to do this as quickly as possible. They need to gather all available information about the affected business service and try to understand why response time has suddenly become unacceptable. This is where Problem Isolation helps.


 


PI works to correlate more than just events. It will pull together data from multiple sources - performance history (resource utilizations), events, even help-desk incidents that have been logged and work to determine the likely issue.


 


So we've come full circle. I spent a lot of time talking about OMi, and events and how an Operations Bridge is assisted by TBEC. But it's not the one and only tool that you need in your bag. Technologies like EUM and PI help catch and diagnose all of the stuff that just cannot be detected by 'simply' )I use that term lightly) monitoring infrastructure.


 


Once again if you want to understand PI better I encourage you to take a look at the posts by Michael Procopio over on the BAC blog.



For HP Operations Center, Jon Haworth.

Event Correlation: OMi TBEC and Problem Isolation - What's the difference (part 2 of 3)

If you have not done so already, you may want to start with part 1 in this series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-1-of-3.aspx


This is part 2 of 3 of my discussion of the event correlation technologies within OMi Topology Based Event Correlation (TBEC) and Problem Isolation. I'm going to focus on talking about how TBEC is used and how it helps IT Operations Management staff be more effective and efficient. My colleague Michael Procopio has discussed PI in more detail over in the BAC blog here: PI and OMi TBEC blog post 


If you think about an Operations Bridge (or "NOC"… but I've blogged my opinion of that term previously) then fundamentally its purpose is very simple.


 


The Ops Bridge is tasked with monitoring the IT Infrastructure (network, servers, applications, storage etc.) for events and resource exceptions which indicate a potential or actual threat to the delivery of the business services which rely on the IT infrastructure. The goal is to fix issues as quickly as possible in order to reduce the occurrence or duration of business service issues.


 


Event detection is an ongoing process 24x7 and the Ops Bridge will monitor the events during all production periods, often 24x7 using shift based teams.


 


Event monitoring is an inexact discipline. In many cases a single incident in the infrastructure will result in numerous events – only one of which actually relates to the cause of the incident, the other events are just symptoms.


 


The challenge for the Ops Bridge staff is to determine which events they need to investigate and to avoid chasing the symptom events. The operations team must prioritize their activities so that they invest their finite resources in dealing with causal events based on their potential business impact, and avoid wasting time in duplication of effort (chasing symptoms) or, even worse, in chasing symptoms down in a serial fashion before they finally investigate the actual causal event, as this will extend the potential for extended downtime of business services.


 


TBEC helps the Operations Bridge in addressing these challenges. TBEC works 24x7, examining the event stream, relating it to the monitored infrastructure and the automatically discovered dependencies between the monitored components. TBEC works to provide a clear indication that specific events are related to each other (related to a single incident) and to identify which event is the causal event and which are symptoms.


 


Consider a disk free space issue on a SAN, which is hosting an oracle database. With comprehensive event monitoring in place, this will result in three events:



  • a disk space resource utilization alert

  • quickly be followed by an Oracle database application error

  • and a further event which indicates that a Websphere server which uses the Oracle database is unhappy


 


Separately, all three events seem ‘important’ – so considerable time could be wasted in duplicate effort as the Ops Bridge tries to investigate all three events. Even worse, with limited resources, it is quite possible that the Operations staff will chase the events ‘top down’ (serially) – look at Websphere first, then Oracle, and finally the SAN – this extends the time to rectification and increases the duration (or potential) of a business outage.


 


TBEC will clearly show that the event indicating the disk space issue on the SAN is the causal event – and the other two events are symptoms.


 


In a perfect world the Ops Bridge can monitor everything, detect every possible event or compromised resource that might impact a business service and fix everything before a business service impact occurs.


 


The introduction of increasingly redundant and flexible infrastructure helps with this – redundant networks, clustered servers, RAID disk arrays, load balanced web servers etc. But, it also can add complications which I’ll illustrate later.


 


One of the challenges of event monitoring is that it simply can NOT detect everything that can impact business service delivery. For example, think about a complex business transaction, which traverses many components in the IT infrastructure. Monitoring of each of the components involved may indicate that they are heavily utilized – but not loaded to the point where an alert is generated.


 


However, the composite effect on the end to end response time of the business transaction may be such that response time is simply unacceptable. For a web based ordering system where customers connect to a company’s infrastructure and place orders for products this can mean the difference between getting orders or the customer heading over to a competitors web site.


 


This is why End User Monitoring technologies are important. I'll talk about EUM in the next, and final, edition of this blog serial.




Read part 3 in the series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-3-of-3.aspx



For HP Operations Center,  Jon Haworth.

Event Correlation: OMi TBEC and Problem Isolation - What's the difference (part 1 of 3)

 I often get asked questions about the differences between two of the products in our Business Service Management Portfolio; BAC Problem Isolation and OMi Topology Based Event Correlation. Folks seem to get a little confused by some of the high level messaging around these products and gain the impression that the two products "do the same thing".


I guess that, as part of HPs Marketing organization, I have to take some of the blame for this so I'm going to blog my conscience clear (or try to).


To aid brevity I'll use the acronyms PI for Problem Isolation and TBEC to refer to OMi Topology Based Event Correlation.


On the face of it, there are distinct similarities between what PI and TBEC do.



  • Both products try to help operational support personnel to understand the likely CAUSE of an infrastructure or application incident.

  • Both products use correlation technologies (often referred to as event correlation) to achieve their primary goal.



I'll try to summarize the differences in a few sentences.



  • TBEC correlates events (based on discovered topology and dependencies) continuously to indicate the cause event in a group of related events. TBEC is "bottom up" correlation that works even when there is NO business impact - it is driven by IT infrastructure issues.

  • PI correlates data from multiple sources to determine the cause (or causal configuration item) where a business service impacting incident has occurred (. PI performs correlation "on demand" and based on a much broader set of data than TBEC. PI might be considered "tops down" correlation because it starts from the perspective of a business service impacting issue.



In reality, the differences between the products are best explained by looking at how they are used and I'll use my next couple of blog posts to do exactly that for TBEC. If you want the detail on PI then visit this 


PI and OMi in the BAC blog


 post from my colleague, Michael Procopio.  


 Read part 2 in the series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-2-of-3.aspx


Read part 3 in the series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-3-of-3.aspx


For HP Operations Center, Jon Haworth.

Everything you wanted to know about OMi... (Q&A from Vivit technical webinar)

Thank you to everyone who attended the Vivit webinar. The recording is now available for viewing on Vivit’s web site. You can also download or view the presentation slides in PDF format. There were many questions from the audience. Jon Haworth and Dave Trout's answers appear below. I have grouped questions by topic.


Product Structure


















Are these 3 different modules to be purchased separately? (topology, event and service views) Yes three different modules. OMi Event Management Foundation is the base product and is a requirement before either of the other two products can be installed. OMi Health Perspective Views and OMi Topology Based Event Correlation are optional modules.
How is the licensing done? There are three separate OMi modules. OMi Event Management Foundation is the base product and is a requirement before either of the other two products can be installed. OMi Health Perspective Views and OMi Topology Based Event Correlation are optional modules. Each module is priced / licensed separately and the pricing model is 'flat' - you purchase the license(s) required and that is all (no CPU or tier or connection based pricing).
How does that scale to thousands of machines? Since we have just introduced OMi, we don't yet have a lot of "real" scalability data to report. However our internal testing so far indicates that OMi can handle the typical event rates handled by OMW/OMU in terms of forwarding events. Like OM today, the scalability of the total solution is not so much limited by how many thousands of machines are being managed but on the total event rate being handled.


Integration with Operations Manager, BAC, UCMDB














































Is there any description about the interface between OM and OMi. There are two interfaces used: 1) Message forwarding from OM to OMi, and 2) Web Services interface for message changes and Topology synchronization.
How is the integration with Operations Manager on Unix? As mentioned during the webinar, OMi requires either OMU or OMW as the event consolidation point for forwarding events into OMi. The event forwarding is configured in OM exactly the same way as if forwarding to another OM server. For message updates and topology synchronization, a Web Services interface is used.
Since it was mentioned it works with both OMU 9.0 and OMW 8.10, does it work with the mentioned SPIs on both platforms ? Yes. We are updating the SPIs to be "OMi ready". What this really means is that we're adding  a little extra information to the event messages (via Custom Message Attributes) to make it 'easier' for OMi to associate a message with the correct CI in the UCMDB and to include specific indicators needed for the TBEC rules in OMi. For OMU 9 we will release some updated SPIs soon which include enhanced discovery - very similar levels of discovery to what OMW has. The discovery subsystem is an area that we enhanced in OMU 9 and we want to be able to use the SPI discovery data as the 'starting point' for populating and maintaining CI and relationship information in the UCMDB - which is what helps to drive the logic in OMi.
How flexible are the integration with BAC products? Are these factory built and need factory to modify due to target environment requirement OMi and BAC use the same UCMDB instance so they are tightly integrated 'out of the box'. OMi is completely built on top of the BAC platform technology. It supports the same security mechanisms, the same HA configuration options, the same user/group definitions, etc. In short, OMi is just like any other BAC "application" that is leveraging the platform.
In the installation guide, it says that one of the requirements is to install the "BSM platform". What exactly do you understand on "BSM platform"? BSM platform means "BAC". OMi 8.10 requires BAC 8.02 as the BSM platform.
Can you run OMi without BSM? No, the BSM platform provides the user interface 'framework' and the runtime UCMDB. OMi plugs into the BSM foundation.
Which security model will take precedence - OMU responsibility matrix or the BAC security for views? OMi security is entirely based on the BAC platform features. Access to OMi views, admin UIs, etc. is all controlled through the standard BAC security features (users/groups, roles, permissions, etc.)
Which security model will take precedence - OMU responsibility matrix or the BAC security for views? OMi security is entirely based on the BAC platform features. Access to OMi views, admin UIs, etc. is all controlled through the standard BAC security features (users/groups, roles, permissions, etc.)
What is the price policy if you have / have not BAC already installed? Having BAC installed makes no difference to the price. OMi includes all components needed (runtime UCMDB etc.) in the license. Pricing is based on a 'flat' price for each of the three modules (see earlier question). You need to contact your local HP sales representative to obtain local pricing.
CI treeview scale? The CI Tree view is basically a UCMDB VIEW/TQL under the covers. TQLs in UCMDB are tuned for VERY efficient retrieval of CI information.


Integration with Ticketing Systems (Service Manager, Service Center)






















How does OMi interact with any ticketing system like Service Manager or Service Center. Will the Ci's health be reflected based on ticket info? In this first release of OMi, there is no direct interaction with a ticketing system. The interaction is driven through the existing OM (OMW or OMU) to Service Manager / Service Center interface. Because OMi synchronizes message changes back to the OM server that it is connected to, trouble tickets can be triggered from that OM server.
How does this interface to Service Manager 7? The interface to SM 7 is driven through the existing OM (OMW or OMU) interface to Service Manager. Because OMi synchronizes message changes back to the OM server that it is connected to, trouble tickets can be triggered from that OM server.
The slides implied "assignment" which looked similar to NNMi. How do the new features of OMi integrate to Service Manager? The concept of assignment is 'internal' to OMi. In many organizations the tier 1 support personnel will deal with non Business Service impacting issues without raising a trouble ticket. NOTE: this is purely dependent on the individual process and organization structure that is selected, we know that a lot of companies work this way to minimize the number of TTs. Some organizations insist that every actionable 'incident' becomes a TT. Where an event is dealt with in OMi then assignment makes sense, where events are forwarded to SM7 or another TT system then assignment will likely take place in the Incident / Helpdesk system.
Will OMi integrate with ITSM (change management app from Front Range)?  Also, I'm assuming that we will need to purchase CMDB for event correlation regardless - is that true?  Cannot comment on the Front Range application. It is likely that an integration may be possible but it would be wise to verify with the vendor what external interfaces they provide for integrating event management systems with their product. No you do not need to purchase UCMDB - we provide a 'free' runtime with OMi.


UCMDB, Discovery and Smart Plug-Ins (SPIs)


































Is it necessary to have UCMDB to have OMi? OMi ships with a "BAC 8.02" media kit. This actually provides the BSM PLatform - including UCMDB - and is licensed using your OMi license key. If you do not have an existing UCMDB then this will provide a runtime UCMDB as part of the OMi product package. If you have an existing BAC 8.02 installed (which includes UCMDB) then you can utilize that for OMi.
Is discovery best done in OMi or uCMDB? All discovery data is maintained in the UCMDB. The 'base' discovery for OMi will be provided by the Smart PlugIns that have been deployed from the OMW or OMU instance that OMi is connected to. Additional discovery data can be added to the UCMDB - for example from NNMi or DDM - and OMi will make use of this discovery data if it exists.
If using DDM for discovery, DDM-Advanced is recommended since it can discover not only hosts but also applications and their relationships.
Can you please tell me if DDMi can be used as a feed? Yes. Servers discovered by DDMi are inserted into UCMDB. However be aware that DDMi does not discover applications and dependencies/relationships. DDM-Advanced is the recommended discovery approach if you plan to use OMi and leverage the TBEC rules in particular.
If uCMDB already has CIs populates by DDM, would the new sources like NNMi , SPIs conflict with them , in other words do we need a clean uCMDB ? No. A clean UCMDB is not required. OMi is designed to work with CIs reqardless of how they are discovered and inserted into the UCMDB. In general, reconciliation of CIs discovered from multiple sources is handled automatically.
Can you clarify what you mean by "we are including these SPIs"? Does this mean it's part of the shrink wrap deliverable with OMi?  What specifically will the virtualization SPI provide?  We were considering another product for that space, but want to hear more about those capabilities. We are not including SPIs with OMi. We are including pre-defined content (event type indicator mappings, health indicators, TBEC correlation rules) for the SPIs that we noted. If you have these SPIs deployed then the time to value when OMi is deployed will be very quick. HP released a SPI for Virtualized Infrastructure monitoring earlier this year. Initial focus is on VMware but we will be providing an update soon with more features. You can contact your HP Software Sales Representative to get more details of the specific functionality provided.
 What is the virtualization SPI? Is it nWorks SPI ? No. HP released the Smart PlugIn for Virtualized Infrastructure early in 2009. This is a HP developed and marketed product.
nWorks is the "SPI" we were considering This is a different SPI and is based on a different architecture (agentless polling). It has no OMi content at present and it will be the responsibility of Nworks / Veeam to provide this.


KPIs (Key Performance Indicators)


















What is a KPI? KPI - means Key Performance Indicators
where do you define the KPIs? OMi provides four KPIs to the BAC platform: Operations Performance, Operations Availability, Unresolved Events, Unassigned Events. These are defined by OMi, not by users. What IS configurable is which Health Indicators (HIs) are assigned to impact either the Operations Performance or Operations Availability KPI for specific CI Types. This is done using the Indicator Manager in OMi.
If the difference is KPI, why data is not collected from PM. Instead I see that the data is collected from OVPA & OV agents. OMi is focused around event processing. Events (alerts) are 'collected' from OVPA and OV agents to enable operations staff to understand what needs to be 'fixed'. PM (Performance Manager) is one tool that can be used to assist in the analysis / diagnosis of performance problems. PM is actually integrated into the OMi user interface.


Topology-Based Event Correlation (TBEC)


























In the slide with "Carol" and "Bill", they applied their knowledge to (I guess) develop some rules?  Is that work that still has to be done manually?  What were they developing - KPIs? No not KPIs. The example is there to show how TBEC rules are simple to create but that the correlation engine 'chains' them together to provide quite complex correlations logic which adapts based on the topology that has been discovered. We (HP) are providing content (Event Type Indicators, Health Indicators, TBEC rules as per "Carol and Bill") for a number of our existing Operations Manager Smart PlugIns with OMi and we will continue to add additional content moving forwards. The example in the slide is there to illustrate the process (simple process) of creating very powerful correlation rules which adapt to changes in the discovered infrastructure. You would only need to undertake this process where HP does not provide out of the box content with OMi.
I have some questions regarding the TBEC, is there any experience regarding the performance?
How many events can be handled by the correlation engine per sec?
The engine is tuned for very high performance. It is basically the same engine that is used in NNMi for correlations.
With topology synchronization with NNMi do you have to have OMi licenses for every node in NNMi as well? ... I.E. if you are using Topology Synchronization with NNMi will it only show the nodes from NNMi that have OMi agents installed? No. All CIs in the UCMDB are visible to OMi. No additional license costs are required for NNMi nodes which are added to the UCMDB.
Which language is used for the correlation rules? And where are the rules defined ? (UCMDB?) TBEC is configured in the OMi Correlation Manager GUI, there is no programming language involved. The rules are based on topology (a View from the UCMDB) and on specific Health Indicators with specific HI values.
Does OMi support the execution of validation routines when closing an Alert/Event that also closes other related items? Not currently out of the box. There are several configurable settings which affect TBEC behavior (e.g. correlation time window, automatic extension of time windows, etc.), but currently this is not one of them. We are considering additional options for the future.


OMi Features


























Scalability, High Availability Cluster Support?  Estimated max seats before going distributed? OMi supports the same cluster/HA features as supported by BAC. For example, you can have multiple gateway servers connected to a clustered Data Processing Server and a remote database server. In this case, OMi software is installed on each of these separate servers (gateways and DPS). In general, the "max seats before going distributed" (i.e. adding gateway servers) would be driven by the same considerations as documented for BAC itself. More information specific to OMi environments will be available over time as we have a chance to do further testing and characterization.
Does OMi have a reports generator showing things like daily TBEC, etc.? Not currently. However the BAC reports (e.g. KPIs over Time) can be used to look at how the OMi KPIs are changing over time on CIs.
Comment: We feel that most of these features being discussed in OMi should have been as an upgrade to OMW. Too many modules to buy and try to integrate ourselves. For example we wanted a better version of the OVOWeb to come as an upgrade in OMW8.1. Too many products to buy just to manage our network. OMi is providing discreet and incremental value above and beyond what is provided in OMW or OMU. We are continuing to enhance both OMW and OMU (for example the recent release of OMU 9.0) and customers who are happy with the capabilities of these platforms can continue to move forwards and take advantage of the enhancements that we are providing. There is no requirement to move to OMi.
We feel we are being charged for features that were supposed to be in products that we already purchased. We are not happy about the tactic of releasing new products to fix features that were advertised in prior software. As a consultant, even I get lost in the vast amount of monitoring tools being sold by HP. OMi  is providing discreet and incremental value above and beyond what is provided in OMW or OMU. This functionality was never offered as part of OMW or OMU - it is new and unique to OMi. The reality is that it would have been extremely difficult, and time consuming (slow to release) to provide the high value capabilities of OMi within OMW or OMU. The strategy we have choosen is to base these new capabilities on a 'clean' build based on contemporary technologies - but HP has specifically ensured that existing OM customers who wish to take advantage of these new capabilities can do so without having to disrupt their existing OM installation.
I had some issue when trying to setup and run the synchronization tool and event forwarding. Who can I contact? You should contact your normal HP support channel for assistance.


Other














Is there an estimated time line for detailed technical training on OMi? We have just run a series of virtual instructor led training sessions for our partners. HP Education Services will be releasing an OMi class in the near future.
Where can I get an evaluation version of OMi? You can request a DVD from the trial software web site. A download will be available at http://www.hp.com/go/omi soon.


 


 For HP Operations Center, Peter Spielvogel.


Get the latest updates on our Twitter feed @HPITOps http://twitter.com/HPITOps


Join the HP OpenView & Operations Management group onLinkedIn.



 

Free Webinar: HP Operations Manager i Software Deep Dive Presentation

My colleague and consolidated event management expert Jon Haworth is the guest speaker at an upcoming Vivit webinar on Tuesday July 21 . Vivit  is the independent HP Software users community.



Jon will talk about using an operations bridge effectively and how the latest advanced correlation and visualization technology can help you reduce downtime. His presentation will address:
• What are the major differences between HP Operations Manager and HP Operations Manager i software?
• How does Topology Based Event Correlation (TBEC) work?
• How does HP OMi fit into my existing Operations Manager environment?


There will be plenty of time for Jon to answer your questions at the end of the session.


For Operations Center, Peter Spielvogel.

Consolidated Event Management Podcast

My colleague and consolidated event management guru Jon Haworth has recorded a podcast with Dennis Drogseth of EMA about the benefits of consolidated IT event management. During this 15 minute podcast, Dennis and Jon discuss:



  • Benefits of an integrated approach to IT Operations

  • Key groups, titles, organizations, roles, initiatives, etc. needed to drive an integrated approach to managing Operations

  • The key underlying technologies

  • Where event consolidation fits


Listen to the Consolidated Event Management podcast (registration required)



At Software Universe, many of my conversations with customers focused on how then can use Operations Manager to consolidate events from across their organization.


For Operations Center, Peter Spielvogel

OMi Webinar and Demo Now Available

Every time I speak to customers about consolidated event and performance management, they want to know HP’s vision. What does the end-state look like? How do all the pieces fit together to save my company money? How does an Operation Bridge drive efficiencies? How does OMi extend my existing monitoring infrastructure? Now, we have a recorded webinar that answers these questions.



In 25 minutes, Jon Haworth, one of the Product Marketing Managers for Operations Center will explain how to:



  • increase the efficiency of managing IT Operations

  • cut costs while improving quality of business services

  • speed the time to problem resolution


In addition, Dave Trout shows a short demo of topology-based event correlation in action, including how to:



  • filter events and identify root causes

  • use system health indicators and KPIs to summarize availability and performance

  • visualize configuration items in the context of business services


See the OMi webinar now.


For Operations Center, Peter Spielvogel.

When is a NOC an Operations Bridge

I've been pondering about recognition of the term "Operations Bridge" for some time now and decided I'd air some thoughts and see what people think.
 
The term "NOC" (Network Operations Center) has been floating around for years, it seems to originate in the telco world but has been adopted by many organizations to refer to some sort of centralized operations function.


But then a lot of organizations still use the term NOC to refer to the Network (only) operations center - the silo which owns and operates the network.
 
So there's the problem that I have with the term NOC.. It's somewhat indistinct, means different things to different people.
 
ITIL V3 recognizes the term "Operations Bridge" as part of the "Service Operation" discipline:
"A physical location where IT Services and IT Infrastructure are monitored and managed."
 
My view is that this is a nice clear definition of the 'modern NOC '- the place where ALL IT infrastructure monitoring comes together and is related to the services which depend on the infrastructure. It avoids confusion about whether we're talking about a "network only" monitoring silo or a full consolidated event and performance management organization.
 
We're using the term Operations Bridge in our own outbound marketing materials. But here is the "rub"... We've done some surveys and the term "Operations Bridge" is not universally recognized - i.e. People do not instantly recognize it or are able to explain what it is.
 
This is not everyone of course but it is true of a large proportion of the people that we tested the term with. I have to add that recognition levels are higher in Europe than the US, maybe something to do with the broader adoption of ITIL . 
 
Interestingly as soon as you start to explain what an "Operations Bridge" is, people "get it" - you don't even need to finish the explanation. It just makes so much sense - and everyone understands the concept of a "Bridge" as a central point of monitoring and control, either because of some nautical knowledge or a passion for Star Trek.
 
So, I'm on a campaign to drive widespread adoption of the term "Operations Bridge" - and move away from the indistinct and sometime confusing term NOC. Make NOC exactly what it states - an NETWORK Operations Center, and use Operations Bridge to describe a 21st Century consolidate IT Infrastructure monitoring function.
 
What do you think? Please enter your response in the comment field below. You may respond anonymously, if you choose.


(A) Yes NOC should be network only, "Operations Bridge" is the centralized monitorng point
(B) No, NOC is the right term
(C) It does not matter, both terms can be used
(D) Other (please elaborate)


For Operations Center, Jon Haworth

Driving down OpEx with technology

It's been a good number of months since I traveled in Europe but I just spent a couple of weeks hopping around on business. There were some interesting changes to the travel experience which got me thinking about parallels in the IT Operations world.
 
First was the check-in experience with Air France / KLM. Basically self-service check-in at a kiosk, with a bag drop. "Nothing new there!" I hear you say. Well what was new was that it was not optional - at least not as far as I could see. There were no check-in agents at desks - only a few floor-walkers minding the kiosks and folks on the bag drop. Obviously this represents a significant cost saving, but that is not my point.


We've all seen this stuff deployed in airports for a while but airlines have been cautious about forcing customers to use it. That was not the case here. All of the regular economy class customers had to use the automated check-in. I guess the airline has overcome (or over-ridden) any fears about the effectiveness of the technology or the danger of impacting the service provided to the customers in favor of some tangible reductions in OpEx.
 
At the other end of the trip I also saw changes in hotel check-out. Now the US has been pretty good at speeding check-out by providing express check-out services. You know the deal, your bill is posted under your door at some horribly early hour of the morning. Your credit card is charged the amount shown unless you decide to go and pursue the regular check-out service. Of course the hotel also benefits because they can deploy less staff to service the check-out transactions. Europe has not adopted this approach. I'm not sure why but I guess it may be some differences in legislation regarding charging someone's credit card without them being present.
 
So the challenge for European hotels is how to maintain quality of service (a rapid check-out) but reduce the OpEx of having lots of staff present during the check-out rush hour in the mornings. A Pullman Hotel in Paris had applied technology to solve this problem. As I headed into the lobby to check out on Friday morning I was ushered towards a bank of (you guessed it) kiosks. The experience was a carbon copy of the airport check-in - just different cards being provided by me and documents being printed by the kiosk. After I had been 'processed' I sat in the lobby waiting for some colleagues to join me. There were three check-out staff that I could see - one on a regular desk duty, two floor-walking the kiosks. The Pullman is a big hotel.
 
Then I started thinking about how the airlines and the hotel were displaying behaviors which are very similar to what we are seeing in the IT Operations space. The drive is to reduce OpEx whilst maintaining service levels. The approach adopted is to use technology to automate activities which have required manual interactions.
The technology is being put into service in spite of any misgivings over its ability to be 100% effective. The companies are willing to take some calculated risks in order to get a demonstrable reduction in OpEx.
 
I see the same behaviors in IT Operations. Forward looking companies are applying technology to automate wherever possible - automate event correlation, automate analysis and problem isolation, automate fixes, automate provisioning. The technology to do a lot of this has been around for years, but previous objections to its deployment - fears over the certainty that the technology will be 100% effective - are being pushed aside as the sights are firmly set on reducing OpEx.
 
For Operations Center, Jon Haworth

Capacity Planning: A long dead mystical art

.... well, maybe not.


Back in the mid-80s and early 90s I made my living doing performance analysis and capacity planning for HP customers.


The roots of capacity planning as a discipline come from the mainframe days where enormously expensive hardware that hosted multiple applications, all competing for the resources, made it essential to be able to plan for new workloads or hardware changes.


Even with the HP mini-computers which I worked with, when someone was considering spending hundreds of thousands of dollars on an upgrade, it was realistic to invest in a few days of consulting. We would build a model of the applications and the hardware and do some serious "what if" analysis to determine what configuration was actually required to do the job.


But times change. Hardware prices dropped significantly - particularly with the introduction of Intel based "industry standard servers". Increasingly applications were deployed in distributed configurations with one application per server.


The capacity planning problem became much 'simpler'. With one application per server - so no issues with applications interacting with each other - and cheap hardware, capacity planning took a back seat to the "throw hardware at it" approach. It became cheaper to add a CPU or more memory and see what happened, than it was to conduct a capacity planning exercise.


What goes around comes around. I'm seeing a couple of things happening which are changing attitudes towards performance management and capacity planning.


First is that most organizations are trying "do more with what they have". There is an awareness that there is likely to be spare capacity somewhere in the environment - it's just a matter of finding it. So we're seeing a lot more demand for enterprise wide performance data collection and reporting. It's worth spending a little money and some time in understanding what server and network resources you have available. It's also the type of diligence that CxOs are expecting in the current economic climate - they expect staff to have exhausted all reasonable options before asking to purchase additional assets.


The second item is the return of the mainframe. I mean that figuratively of course, but large Virtual Server hosts are "the new mainframe". The hardware costs can be substantial as organizations provision powerful, resilient platforms to host multiple virtual machines. And the challenge of having multiple workloads competing for resources is back. In this case each workload is a VM. Organizations want to make optimum use of these expensive VM hosts resources, but they also want to ensure that service levels are maintained when combining VMs. And that requires good performance data collectors that can collect data to support capacity planning from virtualized platforms.
 
I have seen a number of customer requests recently where tools to support capacity planning activities - across enterprises and within virtualized environments - have been front and center.
 
Looks like the mystical art has risen from its grave.


For Operations Center, Jon Haworth

Automated Infrastructure Discovery - Extreme Makeover

Good Discovery Can Uncover Hidden Secrets
Infrastructure discovery has something of a bad reputation in some quarters. We've done some recent surveys of companies utilizing a variety of vendors’ IT operations products. What's interesting is that, in our survey results, automated infrastructure discovery fared pretty badly in terms of the support that it received within organizations - and also in terms of the success that they believed they had achieved.
 
There are a number of reasons underlying these survey results. Technology issues and organizational challenges were highlighted in our survey. But I believe that one of the main 'issues' that discovery has is that people have lost sight of its basic values and the benefits that they can bring. Organizations see 'wide reaching' discovery initiatives as complex to implement and maintain - and they do not see compelling short term benefits.
 
I got to thinking about discovery and the path that it has taken over the last 15 or 20 years. I remember the excitement when HP released its first cut of Network Node Manager. It included discovery that showed people things about their networks that they just did not know. There were always surprises when we took NNM into new sites to demonstrate it. Apart from showing folks what was actually connected to the network, NNM also showed how the network was structured, the topology.
 
Visualization --> Association --> Correlation
And once people can see and visualize those two sets of information they start to make associations about how events detected in the network relate to each other - they use the discovery information to optimize their ability to operate the network infrastructure.
 
So the next logical evolution for tools like NNM was to start building some of the analysis into the software as 'correlation'. For example the ability to determine that the 51 "node down" events you just received are actually just one "router down' event and 50 symptoms generated by the nodes that are 'behind' the router in the network topology. Network operators could ignore the 'noise' and focus on the events that were likely causes of outages. Pretty simple stuff (in principle) but very effective at optimizing operational activities.
 
Scroll forward 15 years. Discovery technologies now extend across most aspects of infrastructure and the use cases are much more varied. Certainly inventory maintenance is a key motivator for many organizations - both software and hardware discovery play important roles in supporting asset tracking and license compliance activities. Not hugely exciting for most Operational Management teams.
 
Moving Towards Service Impact Analysis
Service impact analysis is a more significant capability for Operations Management teams and is a goal that many organizations are chasing. Use discovery to find all my infrastructure components - network devices, servers, application and database instances - and tie them together so I can see how my Business Services are using the infrastructure. Then, when I detect an event on a network device or database I can understand which Business Services might be impacted and I can prioritize my operational resources and activities. Some organizations are doing this quite successfully and getting significant benefits in streamlining their operational management activities and aligning them with the priorities of the business.
 
But there is one benefit of discovery which seems to have been left by the side of the road. The network discovery example I started with provides a good reference. Once you know what is 'out there' and how it is connected together then you can use that topology information to understand how failures in one part of the infrastructure can cause 'ghost events' - symptom events' - to be generated by infrastructure components which rely in some way on the errant component. When you get 5 events from a variety of components - storage, database, email server, network devices - then if you know how those components are 'connected' you can relate the events together and determine which are symptoms and which is the likely cause.
 
Optimizing the Operations Bridge
Now, to be fair, many organizations understand that this is important in optimizing their operational management activities. In our survey, we found that many companies deploy skilled people with extensive knowledge of the infrastructure into the first level operations bridge to help make sense of the event stream - try to work out which events to work on and which are dead ends. But it's expensive to do this - and not entirely effective. Operations still end up wasting effort by chasing symptoms before they deal with the actual cause event. Inevitably this increases mean time to repair, increases operational costs and degrades the quality of service delivered to the business.
 
So where is the automation? We added correlation to network monitoring solutions years ago to help do exactly this stuff, why not do 'infrastructure wide' correlation'?
 
Well, it's a more complex problem to solve of course. And there is also the problem that many (most?) organizations just do not have comprehensive discovery across all of their infrastructure. Or if they do have good coverage it's from a variety of tools so it's not in one place where all of the inter-component relationships can be analyzed.
 
Topology Based Event Correlation - Automate Human Judgment
This is exactly the problem which we've been solving with our Topology Based Event Correlation (TBEC)  technology. Back to basics - although the developers would not thank me for saying that, as it's a complex technology. Take events from a variety of sources, do some clever stuff to map them to the discovered components in the discovery database (discovered using a number of discrete tools) and then use the relationships between the discovered components to automatically do what human operators are trying to do manually - indicate the cause event.
 
Doing this stuff automatically for network events made sense 15 years ago, doing it across the complexity of an entire infrastructure makes even more sense today. It eliminates false starts and wasted effort.
 
This is a 'quick win' for Operational Management teams. Improved efficiency, reduced operational costs, free up senior staff to work on other activities… better value delivered to the business (and of course huge pay raises for the Operations Manager).
 
So what do you need to enable TBEC to help streamline your operations. Well, you need events from infrastructure monitoring tools - and most organizations have more than enough of those. But you also need infrastructure discovery information - the more the better.
 
Maybe infrastructure discovery needs a makeover.

 

For HP Operations Center, Jon Haworth


 

Consolidated IT Operations: Return of the Prodigal Son

Let's face it, the concept of bringing together all of your IT infrastructure monitoring into a single "NOC" or Operations Bridge has been around for years. Mainframe folks will tell you they were doing this stuff 30 years ago.

 

Unfortunately, in the distributed computer systems world, a lot of organizations have still not managed to successfully consolidate all of their IT infrastructure operations. I see a lot of companies who believe that they have made good progress, often they've managed to pull together most of the server and application operations activities, maybe minimized the number of monitoring tools that they use.

 

But when you dig below the surface, often there will be a separate network operations team, and maybe an application support team that owns a 'special' application. And of course the admins who are responsible for the roll out of the new virtualization technology - that just "cannot" be monitored by the normal operations tools and processes.

 

And that's the problem... Often there is resistance from a number of different angles to initiatives which try to pull end-to-end infrastructure monitoring into a single place. Legacy organizational resistance is probably the biggest challenge - silos have a tendency to be very difficult to 'flatten'.

 

Another common theme is that the technical influencers (architects, consultants, application specialists etc.) in the organization create FUD that the toolset used by the operations teams is not suitable for monitoring the new technology that they are rolling out. They need to use their own special monitoring solution or the project will fail. Because it's a new technology and everyone is scared of a failed rollout, management acquiesces and another little fragmented set of monitoring technology, organization and processes is born. Every new technology has potential for this - I've seen it happen with MS Windows, Linux, Active Directory, Citrix, VMware - the list is endless.

 

So what? I hear you say, what's your point? Well I'm seeing a lot of organizations revisiting the whole topic of consolidating their IT operations and establishing a single Operations Bridge - and making some significant changes.

 

Why now? Simple - to reduce the Operational Expenditures associated with keeping the lights on. In the current economic climate organizations are motivated 'top down' to drive cost out wherever they can. Initiatives that deliver cost reductions in the short term get executive sponsors. There is also a lot lower tolerance for the kinds of hurdles that used to be raised as objections - organizational silos get flattened, tool portfolios are rationalized.

 

It's not just about cutting cost of course. Simply reducing headcount would achieve that goal, but the chances are that the quality of IT service delivered to the business would suffer, and there would be direct impacts on the ability of the business to function.

 

Of course, the trick is to consolidate into an Operations Bridge, and be able to deliver the same or higher quality IT services to the business but with reduced cost. Often the economies of scale and streamlined, consistent processes that are enabled by an Operations Bridge will deliver significant benefits - and reduce OpEx.

 

This is where HP's Operation Center solutions have focussed for the last 12 or 15 years. In my next post I'll talk about where HP see the next significant gains being made - where are we focusing so we can help our customers to take their existing Operations Bridge and significantly increase efficiency and effectiveness.

 

In the meantime, if you want to read a little more about the case for consolidated operations, take a look at this white paper "Working Smart in IT Operations - the case for consolidated operations".

 

For HP Operations Center, Jon Haworth.

 

 

Search
Follow Us


HP Blog

HP Software Solutions Blog

Labels
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation