Business Service Management (BAC/BSM/APM/NNM)
More than everything monitoring, BSM provides the means to determine how IT impacts the bottom line. Its purpose and main benefit is to ensure that IT Operations are able to reactively and proactively determine where they should be spending their time to best impact the business. This covers event management to solve immediate issues, resource allocation and through reporting performance based on the data of applications, infrastructure, networks and from third-party platforms. BSM includes powerful analytics that gives IT the means to prepare, predict and pinpoint by learning behavior and analyzing IT data forward and backwards in time using Big Data Analytics applied to IT Operations.

Fighting or friendly, Problem Isolation and OMi

by Michael Procopio


In the post  Event Correlation OMi TBEC and Problem Isolation What's the Difference, my fellow blogger, Jon Haworth, discussed the differences between TBEC and Problem Isolation. To be consistent, I'll use the acronyms PI for Problem Isolation and TBEC to refer to OMi (Operations Manager i series) Topology Based Event Correlation.


Briefly, he mentioned that TBEC works “bottom up”, that is starting from the infrastructure, with events. PI works “top down”, that is, starting from an end user experience problem, primarily with metric (time series) data.


Jon did an excellent job describing TBEC; I’ll do my best on PI because like Jon I have a conscience to settle.


Problem Isolation is a tool to:


1. automate the steps a troubleshooter would go through


2. run additional tests that might uncover the problem


3. look at all metric/performance data from the end user experience monitoring and all the infrastructure it depends


4. find the infrastructure metric the most closely matches the end user problem using behavior learning and regression analysis techniques (developed by HP Labs)


5. bring additional data such as events, help/service desk tickets and changes to the troubleshooter


6. allow the troubleshooter to execute Run books to potentially solve the problem


Potentially the biggest difference in the underlying technology is that Problem Isolation does not require any correlation rules or thresholds to be set for it to do the regression analysis to point to the problem. Like TBEC, it does require that an application be modeled in a CMDB.


An example: Presume a situation with a typical composite application - web server, application server and database. No infrastructure thresholds were violated; therefore, there are no infrastructure alerts. Again, as mentioned in the previous post, end user monitoring (EUM) is the back stop. EUM alerts on slow end user performance, now what?


Here is what Problem Isolation does:


1. determines which infrastructure elements (ITIL configurations items or CIs) support the transaction


2. reruns the test(s) that caused the alert – this validates it is not transient problem


3. runs any additional tests defined for the CIs


4. collects Service Level Agreement information


5. collects all available infrastructure performance metrics (web server, application server, database server and operating systems for each) and compares them to the EUM data using behavior and regression analysis



Problem Isolation screen show performance correlation between end user response and SQL Server database locks


-------------------------------------------------------------------------------------------


6. determines and displays the most probable suspect CI and alternates


7. displays run books available for all infrastructure CIs for the PI user to run directly from the tool


8. allows the PI user to attach all the information to a service ticket, either existing or create a new one


Another key differentiator of OMi/TBEC and PI is the target user. There is such a wide variance in how organizations work that it is hard to name the role but let me do a brief description and I think will be able to determine the title in your organization.


There are some folks in the organization whose job is to take a quick look (typically < 10 minutes, in one organization I interviewed < 1 minute) at a situation and determine if they have explicit instructions on what to do via scripts or run books. When they have no instructions for a situation they pass it on to someone who has a bit more experience and does some free form triage.


This person might be able to fix the problem or may have to pass it on to a subject matter expert, for example if they believe it is an MS Exchange problem to an Exchange admin. It is this second person that Problem Isolation is targeted at. This is helping automate her job, reducing what might take tens of minutes to hours and performing it in seconds. If it ends up she can’t solve the problem it automatically provides full documentation of all information collected. That alone might take someone five minutes to write-up.


OMi’s target is the operations bridge console user. Ops Bridge operators tend to be lower skilled and face hundreds if not thousands of events per hour. Jon described how OMi helps them work smarter.


TBEC and Problem Isolation both work to find the root cause of an incident but in different ways. Much like a doctor might use an MRI or CAT scan to diagnose a patient based on what the situation is, TBEC and Problem Isolation are complementary tools each with unique capabilities.


Problem Isolation will not find problems in redundant infrastructure that OMi will. Conversely, OMi can’t help with EUM problems when no events are triggered, where Problem Isolation will.


We know this can be a confusing area. We welcome your questions to help us do a better job of describing the difference. But these two are definitely friendly.


For Business Availability Center, Michael Procopio


Get the latest updates on our Twitter feed @HPITOps http://twitter.com/HPITOps


Join the HP Software group on LinkedIn and/or the Business Availability Center group on LinkedIn.


Related Items



  1. Advanced analytics reduces downtime costs - detection

  2. Advanced analytics reduces downtime costs – isolation

  3. Problem Isolation page

  4. Operations Manager i page

HP Software Universe – Mainstage Andy Isherwood


Andy Isherwood VP, Support & Services
kicked off Mainstage.


There are four key areas shown in the picture above. HP announced this week its  IT Financial Management offering. Andy likened ITFM to an ERP system for IT. Information Management magazine wrote an article on HP ITFM.


HP has had offerings in IT Performance Analytics and IT Resource Optimization for awhile. HP Cloud Assure was announced was announced in May 2009, HP Unveils “Cloud Assure” to Drive Business Adoption of Cloud Services


Some key points from his opening remarks:



  1. Prepare for coming out of the
    recession when cutting costs and innovating.


  2. Best in class means being good at all
    four - aligning to the business, taking out costs, increasing efficiency and
    consolidation.


  3. Jetblue, Altec and T-Mobile were the
    winners of the HP Software Award of Excellence.


  4. As an example of the quick ROI
    companies can get, Altec produced 10% application downtime reduction, 20% faster
    response time, 15% increase in customer satisfaction and a 300% improve
    application transaction time in 6 months.


  5. Last year we were HP Software, this
    year HP Software and Solutions. This was the combining HP Software with HP
    Consulting and Integration. The net result increased our delivery options. In
    addition to offering software for in-house use, HP now has EDS, SaaS and
    continues with it Partners


  6. HP SaaS business is seven years old
    this year and has 650 customers.


You can read other coverage of HP Software Universe in the ITOpsBlog. There are a variety of Twitter accounts
you can follow:


HPITOps  – Covers BSM, ITFM, ITSM, Operations and
Network Management


HPSU09  – show logistics and other
information


HPSoftwareCTO


informationCTO


HPSoftware


BTOCMO – HP BTO Chief Marketing Officer


as well as the Twitter hashtag #HPSU09


For HP BSM, Michael Procopio


 

Advanced analytics reduces downtime costs – isolation

by Michael Procopio, Product Manager, BAC 



In the world of advanced analytics, two areas that are of interest to the IT management world are:  detection of a problem and isolation of a problem. Previously I wrote Advanced analytics reduces downtime costs – detection; in this post I’ll cover isolation.


In the previous post, I covered how advanced analytics finds an anomaly, potentially before a threshold is crossed.


Problem Isolation is the process of determining which component in the infrastructure is causing the problem* or incident* that we found. We will presume we are monitoring the service that is having the issue.


If one had no management tools (amazingly I have spoken to customers in this situation) the method of trying to find a problem is to login to each system, router, switch and potentially application (ex: Oracle) look at the items with whatever tools are available (ex: Windows Perfmon)and hopefully you find it. If you are interested in advanced analytics, this is probably not your situation.


The more typical case is you have multiple management tools, network, system, virtualization, database and perhaps others. So if  you know the domain the problem exists in you have a good place to start. I’ve listened to podcasts / read reports which bring up few problems with this: (if you know of any good IT podcasts please send them along)



  • ~80% of problems are sent to the network team with only ~20% being network issues

  • ~60% of problems take >10 experts to resolve

  • ~80% of the time to restore service is spent isolating the problem


Here is an analogy I use with my non IT friends on why this area is needed. You are monitoring the speed of a car going across the country (pick your favorite country). You are separately monitoring the infrastructure, all:



  • roads

  • bridges

  • ferries to take cars across the water


What you don’t know is where the car is (old car, no GPS). You are getting many alerts from the roads, bridges and ferries. Which one is affecting the car? Since you don’t know what road the car is on you don’t know if any given alert is the one affecting your car.


This is where the CDMB comes into the isolation process. The CMDB has the route the car is taking or, in our case, the items in the IT infrastructure that make up the service that has the problem.


Part one of the isolation process is to restrict what we are looking at to the relevant IT items. This greatly reduces the computational power required. For example, one customer I recently visited told me he has 2000+ servers. If we can reduce that to a few app servers and a few database servers (isn’t SOA wonderful for we operations types) that is a factor of ~200 reduction.


Part two of the isolation is the heavy math from HP Labs, with more patent filings.  It is a form of regression analysis, where application or end user response time monitoring is the dependent variable and all the infrastructure metrics are independent variables. In plain terms, if end user response gets worse find the infrastructure metrics that get worse. When end user response gets better find the metrics that get better. The more closely an infrastructure metric tracks the end user response the more likely it is to be the cause.


Again, while the math is interesting, pictures work better for me.


 


The thick grey line is the end user response, the red-purple line is the most closely correlated metric -- in this case a database metric. Just so you don’t have to strain your eyes we provide a table like this (from a different problem) showing the weighted correlations score.


 


Isolation part 3 is to include non-time series data. In the screen capture below you see planned changes and incident details (think alerts) on the timeline. Unplanned changes can also be displayed. Changes are pulled from the CMDB and incidents can come from any management system that can send alerts. And since we know that most problems occur from changes that is an important component. Finally tickets from the helpdesk are included on the timeline, for the case where users are doing the monitoring.


 


All together this automates a number of things the operations teams already do and some math help isolating problems.


 


*Incident and problem are ITIL terms. There may be many incidents that are symptoms of an underlying problem.


 


tweet this!


 


Related Items



Since I asked for podcasts here are some I listen too:


Advanced analytics reduces downtime costs - detection

by Michael Procopio 


In the world of advanced analytics, two areas that are of interest to the IT management world are: detection of a problem and isolation of a problem. In this post I'll cover detection.


Problem detection, typically called anomaly detection, in the analytics circles started in a very basic way. Take a metric, say CPU utilization, set a threshold for it and anytime the threshold is crossed, we have an anomaly.


This, of course, has many problems:


·         How do I know where to set a threshold


·         The right level may be different at different times


·         If there is a one sample spike above the threshold is that really an anomaly I care about (for some it is but not for most in my experience)


The next step in setting thresholds was using standard deviation (STD). I will create a sleeve of upper and lower bounds that cover a major percentage of the situations (+/- 1 STD covers 68.2%) I have measured and use that. This has some of the same problems as above. However, let’s focus on the time period problem.


The next step is to set thresholds by time of day. With this added capability I can set a reasonable threshold for the typical 10am and 2pm peak traffic periods separately and alerts still come if there is unusual behavior at 8am. This quickly leads to “my Mondays are busier than most of my other days”. To avoid false alerts, this leads us to time of day and day of week where we keep the standard deviation for each of the 168 hours of the week.


The next problem, the end of quarter booking and shipping madness or it is black Friday (largest shopping day of the year) and we realize we need to add in a seasonally adjusted set of thresholds as well. And something that seasonality can't take into account are macro events such as a weak economy affecting purchasing.


Of course, none of these will take into account the spikes mentioned above. How to solve all these problems -- hmmm.


There is a completely different approach, for which HP Labs has a patent filing, which uses more sophisticated machine learning. Like the other approaches, it breaks time up into segments, which, in the paper Achieving Scalable Automated Diagnosis of Distributed Systems Performance Problems, are called epochs. Unlike the other approaches, it does not simply compare now to a predetermined set of threshold levels.


 


This method compares ‘now’ to recent epochs as well as previous learning and makes a determination of what behavior is good and bad. While a spike is bad, if the epoch is behaving well overall it is considered to have good behavior. There is a lot of math behind this but I find looking a picture much more obvious.


 


 


Notice the gray outline the looks like a city skyline. This is what the algorithm has determine is ‘normal’ or good behavior. The hatch lines on the right show where it found an anomaly.


These advanced algorithms are implemented in HP Problem Isolation. In the next post, I’ll discuss how analytics are used to find the source of the problem.


 tweet this!


Related Items



 


 

Search
Showing results for 
Search instead for 
Do you mean 
About the Author(s)
  • Anil is an enterprise software professional with 15+ years of experience. He has both breadth and depth of understanding in IT Infrastructure management including Network, System, Storage, Virtualization and Cloud. As a product manager, Anil had successfully introduced many new products into the world wide market.He innovates on regular basis and he holds many patents.
  • Doug is a subject matter expert for network and system performance management. With an engineering career spanning 25 years at HP, Doug has worked in R&D, support, and technical marketing positions, and is an ambassador for quality and the customer interest.
  • Dan is a subject matter expert for BSM now working in a Technical Product Marketing role. Dan began his career in R&D as a devloper, and team manger. He most recently came from the team that created and delivered engaging technical training to HP pre-sales and Partners on BSM products/solutions. Dan is the co-inventor of 6 patents.
  • This account is for guest bloggers. The blog post will identify the blogger.
  • Manoj Mohanan is a Software Engineer working in the HP OMi Management Packs team. Apart being a developer he also dons the role of an enabler, working with HP Software pre-sales and support teams providing technical assistance with OMi Management Packs. He has experience of more than 8 years in this product line.
  • Nimish Shelat is currently focused on Datacenter Automation and IT Process Automation solutions. Shelat strives to help customers, traditional IT and Cloud based IT, transform to Service Centric model. The scope of these solutions spans across server, database and middleware infrastructure. The solutions are optimized for tasks like provisioning, patching, compliance, remediation and processes like Self-healing Incidence Remediation and Rapid Service Fulfilment, Change Management and Disaster Recovery. Shelat has 21 years of experience in IT, 18 of these have been at HP spanning across networking, printing , storage and enterprise software businesses. Prior to his current role as a World-Wide Product Marketing Manager, Shelat has held positions as Software Sales Specialist, Product Manager, Business Strategist, Project Manager and Programmer Analyst. Shelat has a B.S in Computer Science. He has earned his MBA from University of California, Davis with a focus on Marketing and Finance.
  • Architect and User Experience expert with more than 10 years of experience in designing complex applications for all platforms. Currently in Operations Analytics - Big data and Analytics for IT organisations. Follow me on twitter @nuritps
  • Pranesh Ramachandran is a Software Engineer working in HP Software’s System Management & Virtualization Monitoring products’ team. He has experience of more than 7 years in this product line.
  • Ramkumar Devanathan (twitter: @rdevanathan) works in the IOM-Customer Assist Team (CAT) providing technical assistance to HP Software pre-sales and support teams with Operations Management products including vPV, SHO, VISPI. He has experience of more than 12 years in this product line, working in various roles ranging from developer to product architect.
  • Ron is a subject matter expert for BSM\APM, Currently in the Demo Solutions Group. Ron have over thirteen years of technology experience, and a proven track record in providing exceptional customer service. He began his career in R&D as a software engineer, and team manager.
  • Stefan Bergstein is chief architect for HP’s Operations Management & Systems Monitoring products, which are part HP’s business service management solution. His special research interests include virtualization, cloud and software as a service.
Follow Us


HP Blog

HP Software Solutions Blog

Labels
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation