Business Service Management (BAC/BSM/APM/NNM)
More than everything monitoring, BSM provides the means to determine how IT impacts the bottom line. Its purpose and main benefit is to ensure that IT Operations are able to reactively and proactively determine where they should be spending their time to best impact the business. This covers event management to solve immediate issues, resource allocation and through reporting performance based on the data of applications, infrastructure, networks and from third-party platforms. BSM includes powerful analytics that gives IT the means to prepare, predict and pinpoint by learning behavior and analyzing IT data forward and backwards in time using Big Data Analytics applied to IT Operations.

Do you have an ITIL problem manager?

by Michael Procopio


What do Healthcare, Banking and Managed Service Providers have in common? Well at least the ones I spoke to all had ITIL problem managers. I’ll define problem management then give some examples.


What is the role of an ITIL problem manager? This was role introduced in ITIL v3 as part of Continuous Service Improvement. To understand the role we need to back up and describe incident and problem management.




An incident is a service interruption. Incident management is the process of restoring normal service. Frequently this is done with a work around, like killing a process and restarting or rebooting a system. Incident management is not worked about finding the root cause of the interruption.



The goal of problem management is prevent incidents from happening. What this means in practice is finding the root cause of incidents and fixing them and maybe along the way fix some others before they happen.


Let’s move to a couple examples.


1. Those darn log files. This came out in the first interview I did. What this showed me is that problem managers work closely with folks working on incidents. The result of the incident investigation was that an application written in-house wasn’t checking for disk space before writing the log file. Result, volume ran out of space and application stopped working. This was an easier find than many, but the problem manager put this in her monthly newsletter to developers and QA teams to prevent future problems.


2. One thing software can not fix is hardware. You have mission critical application and you end up getting hardware problems approximately every month for a few months. The problem manager, as part of his normal duties, is constantly monitoring incidents looking for patterns. While looking through incidents notices this pattern and starts investigating. In this case after talking to a number of folks and not hitting any answer he started investigating the hardware records and found the hardware was over four years old which is over their policy standard. Somehow the systems supporting this application missed their refresh. As you might guess once the hardware was refreshed the problems stopped


3. Sometimes its what’s not happening. A global ecommerce capability is not working at one location for 8 minutes. This problem manager says he is in the subfield of epidemiology, I think of more as forensics. This brings up a second learning I had: problem managers often orchestrate the process as much as they investigate themselves. After looking at a number of things that were happening to no avail, he asks each team to look at log files for abnormally low activity. This took some convincing, but the network team found a router that was showing almost no activity relative to normal. I didn’t get all the details but the source was someone made a change to a routing table and 8 minutes later changed it back. That change routed traffic around the site in question. The problem manager made an interesting observation: network folks often think what they do won’t affect anything and since they rarely get feedback to the contrary it re-enforces that opinion.


I’ll wrap of this post with what we can do to help the problem manager. For workflow we have the HP Service Manager Problem Management module. This includes logging, categorization, prioritization, communication and progress tracking.


For investigation we have HP Problem Isolation. Its proactive module helps find anomalies in performance data that can indicate potential upcoming problems. It also brings together incident and change data to help determine the cause of the problem.


 Related Items:


Evelyn Hubbert at Forrester wrote a report Problem Manager: A New IT Service Management Role, The Key To A Proactive IT Service Organization


Advanced analytics reduces downtime costs – detection


Advanced analytics reduces downtime costs – isolation


 


 

Fighting or friendly, Problem Isolation and OMi

by Michael Procopio


In the post  Event Correlation OMi TBEC and Problem Isolation What's the Difference, my fellow blogger, Jon Haworth, discussed the differences between TBEC and Problem Isolation. To be consistent, I'll use the acronyms PI for Problem Isolation and TBEC to refer to OMi (Operations Manager i series) Topology Based Event Correlation.


Briefly, he mentioned that TBEC works “bottom up”, that is starting from the infrastructure, with events. PI works “top down”, that is, starting from an end user experience problem, primarily with metric (time series) data.


Jon did an excellent job describing TBEC; I’ll do my best on PI because like Jon I have a conscience to settle.


Problem Isolation is a tool to:


1. automate the steps a troubleshooter would go through


2. run additional tests that might uncover the problem


3. look at all metric/performance data from the end user experience monitoring and all the infrastructure it depends


4. find the infrastructure metric the most closely matches the end user problem using behavior learning and regression analysis techniques (developed by HP Labs)


5. bring additional data such as events, help/service desk tickets and changes to the troubleshooter


6. allow the troubleshooter to execute Run books to potentially solve the problem


Potentially the biggest difference in the underlying technology is that Problem Isolation does not require any correlation rules or thresholds to be set for it to do the regression analysis to point to the problem. Like TBEC, it does require that an application be modeled in a CMDB.


An example: Presume a situation with a typical composite application - web server, application server and database. No infrastructure thresholds were violated; therefore, there are no infrastructure alerts. Again, as mentioned in the previous post, end user monitoring (EUM) is the back stop. EUM alerts on slow end user performance, now what?


Here is what Problem Isolation does:


1. determines which infrastructure elements (ITIL configurations items or CIs) support the transaction


2. reruns the test(s) that caused the alert – this validates it is not transient problem


3. runs any additional tests defined for the CIs


4. collects Service Level Agreement information


5. collects all available infrastructure performance metrics (web server, application server, database server and operating systems for each) and compares them to the EUM data using behavior and regression analysis



Problem Isolation screen show performance correlation between end user response and SQL Server database locks


-------------------------------------------------------------------------------------------


6. determines and displays the most probable suspect CI and alternates


7. displays run books available for all infrastructure CIs for the PI user to run directly from the tool


8. allows the PI user to attach all the information to a service ticket, either existing or create a new one


Another key differentiator of OMi/TBEC and PI is the target user. There is such a wide variance in how organizations work that it is hard to name the role but let me do a brief description and I think will be able to determine the title in your organization.


There are some folks in the organization whose job is to take a quick look (typically < 10 minutes, in one organization I interviewed < 1 minute) at a situation and determine if they have explicit instructions on what to do via scripts or run books. When they have no instructions for a situation they pass it on to someone who has a bit more experience and does some free form triage.


This person might be able to fix the problem or may have to pass it on to a subject matter expert, for example if they believe it is an MS Exchange problem to an Exchange admin. It is this second person that Problem Isolation is targeted at. This is helping automate her job, reducing what might take tens of minutes to hours and performing it in seconds. If it ends up she can’t solve the problem it automatically provides full documentation of all information collected. That alone might take someone five minutes to write-up.


OMi’s target is the operations bridge console user. Ops Bridge operators tend to be lower skilled and face hundreds if not thousands of events per hour. Jon described how OMi helps them work smarter.


TBEC and Problem Isolation both work to find the root cause of an incident but in different ways. Much like a doctor might use an MRI or CAT scan to diagnose a patient based on what the situation is, TBEC and Problem Isolation are complementary tools each with unique capabilities.


Problem Isolation will not find problems in redundant infrastructure that OMi will. Conversely, OMi can’t help with EUM problems when no events are triggered, where Problem Isolation will.


We know this can be a confusing area. We welcome your questions to help us do a better job of describing the difference. But these two are definitely friendly.


For Business Availability Center, Michael Procopio


Get the latest updates on our Twitter feed @HPITOps http://twitter.com/HPITOps


Join the HP Software group on LinkedIn and/or the Business Availability Center group on LinkedIn.


Related Items



  1. Advanced analytics reduces downtime costs - detection

  2. Advanced analytics reduces downtime costs – isolation

  3. Problem Isolation page

  4. Operations Manager i page

Advanced analytics reduces downtime costs – isolation

by Michael Procopio, Product Manager, BAC 



In the world of advanced analytics, two areas that are of interest to the IT management world are:  detection of a problem and isolation of a problem. Previously I wrote Advanced analytics reduces downtime costs – detection; in this post I’ll cover isolation.


In the previous post, I covered how advanced analytics finds an anomaly, potentially before a threshold is crossed.


Problem Isolation is the process of determining which component in the infrastructure is causing the problem* or incident* that we found. We will presume we are monitoring the service that is having the issue.


If one had no management tools (amazingly I have spoken to customers in this situation) the method of trying to find a problem is to login to each system, router, switch and potentially application (ex: Oracle) look at the items with whatever tools are available (ex: Windows Perfmon)and hopefully you find it. If you are interested in advanced analytics, this is probably not your situation.


The more typical case is you have multiple management tools, network, system, virtualization, database and perhaps others. So if  you know the domain the problem exists in you have a good place to start. I’ve listened to podcasts / read reports which bring up few problems with this: (if you know of any good IT podcasts please send them along)



  • ~80% of problems are sent to the network team with only ~20% being network issues

  • ~60% of problems take >10 experts to resolve

  • ~80% of the time to restore service is spent isolating the problem


Here is an analogy I use with my non IT friends on why this area is needed. You are monitoring the speed of a car going across the country (pick your favorite country). You are separately monitoring the infrastructure, all:



  • roads

  • bridges

  • ferries to take cars across the water


What you don’t know is where the car is (old car, no GPS). You are getting many alerts from the roads, bridges and ferries. Which one is affecting the car? Since you don’t know what road the car is on you don’t know if any given alert is the one affecting your car.


This is where the CDMB comes into the isolation process. The CMDB has the route the car is taking or, in our case, the items in the IT infrastructure that make up the service that has the problem.


Part one of the isolation process is to restrict what we are looking at to the relevant IT items. This greatly reduces the computational power required. For example, one customer I recently visited told me he has 2000+ servers. If we can reduce that to a few app servers and a few database servers (isn’t SOA wonderful for we operations types) that is a factor of ~200 reduction.


Part two of the isolation is the heavy math from HP Labs, with more patent filings.  It is a form of regression analysis, where application or end user response time monitoring is the dependent variable and all the infrastructure metrics are independent variables. In plain terms, if end user response gets worse find the infrastructure metrics that get worse. When end user response gets better find the metrics that get better. The more closely an infrastructure metric tracks the end user response the more likely it is to be the cause.


Again, while the math is interesting, pictures work better for me.


 


The thick grey line is the end user response, the red-purple line is the most closely correlated metric -- in this case a database metric. Just so you don’t have to strain your eyes we provide a table like this (from a different problem) showing the weighted correlations score.


 


Isolation part 3 is to include non-time series data. In the screen capture below you see planned changes and incident details (think alerts) on the timeline. Unplanned changes can also be displayed. Changes are pulled from the CMDB and incidents can come from any management system that can send alerts. And since we know that most problems occur from changes that is an important component. Finally tickets from the helpdesk are included on the timeline, for the case where users are doing the monitoring.


 


All together this automates a number of things the operations teams already do and some math help isolating problems.


 


*Incident and problem are ITIL terms. There may be many incidents that are symptoms of an underlying problem.


 


tweet this!


 


Related Items



Since I asked for podcasts here are some I listen too:


Search
Showing results for 
Search instead for 
Do you mean 
About the Author(s)
  • Doug is a subject matter expert for network and system performance management. With an engineering career spanning 25 years at HP, Doug has worked in R&D, support, and technical marketing positions, and is an ambassador for quality and the customer interest.
  • Dan is a subject matter expert for BSM now working in a Technical Product Marketing role. Dan began his career in R&D as a devloper, and team manger. He most recently came from the team that created and delivered engaging technical training to HP pre-sales and Partners on BSM products/solutions. Dan is the co-inventor of 6 patents.
  • This account is for guest bloggers. The blog post will identify the blogger.
  • Over 11 years of experience in design and development of NMS/EMS products and presently with the Device content support covering broad based features of multitude device vendors in NNMi.
  • Manoj Mohanan is a Software Engineer working in the HP OMi Management Packs team. Apart being a developer he also dons the role of an enabler, working with HP Software pre-sales and support teams providing technical assistance with OMi Management Packs. He has experience of more than 8 years in this product line.
  • HP Software BSM Social Media
  • Nimish Shelat is currently focused on Datacenter Automation and IT Process Automation solutions. Shelat strives to help customers, traditional IT and Cloud based IT, transform to Service Centric model. The scope of these solutions spans across server, database and middleware infrastructure. The solutions are optimized for tasks like provisioning, patching, compliance, remediation and processes like Self-healing Incidence Remediation and Rapid Service Fulfilment, Change Management and Disaster Recovery. Shelat has 21 years of experience in IT, 18 of these have been at HP spanning across networking, printing , storage and enterprise software businesses. Prior to his current role as a World-Wide Product Marketing Manager, Shelat has held positions as Software Sales Specialist, Product Manager, Business Strategist, Project Manager and Programmer Analyst. Shelat has a B.S in Computer Science. He has earned his MBA from University of California, Davis with a focus on Marketing and Finance.
  • Architect and User Experience expert with more than 10 years of experience in designing complex applications for all platforms. Currently in Operations Analytics - Big data and Analytics for IT organisations. Follow me on twitter @nuritps
  • 36-year HP employee that writes technical information for HP Software Customers.
  • Pranesh Ramachandran is a Software Engineer working in HP Software’s System Management & Virtualization Monitoring products’ team. He has experience of more than 7 years in this product line.
  • Ramkumar Devanathan (twitter: @rdevanathan) works in the IOM-Customer Assist Team (CAT) providing technical assistance to HP Software pre-sales and support teams with Operations Management products including vPV, SHO, VISPI. He has experience of more than 12 years in this product line, working in various roles ranging from developer to product architect.
  • Ron Koren is a subject matter expert for BSM / APM, currently in the Demo Solutions Group acting as a Senior Architect. Ron has over fourteen years of technology experience, and a proven track record in providing exceptional customer service. Ron began his career in R&D as a software engineer, and later as a team manager. Ron joined HP software in 2007 as an engineer in the Customer-Oriented R&D team. Prior to joining HP, Ron held a leadership development role at Israel’s largest bank. Ron holds a B.S. in Computer Science from The Interdisciplinary Center, Herzliya Israel.
  • Stefan Bergstein is chief architect for HP’s Operations Management & Systems Monitoring products, which are part HP’s business service management solution. His special research interests include virtualization, cloud and software as a service.
  • With 11 plus years of very broad experience as a deployment expert for all the NMC products, my deliverables includes helping the Sales and Pre-Sales team in sizing and architecting the solution and hardware, assisting the implementers in product deployment and helping the customers directly when the products are deployed in production setup. As part of Customer Assist Team, I participate in a lot of customer facing activities from R&D side and provides best practices of using HP SW NMC products for efficient network management and leverage my rich experience in Network Node Manager and related iSPIs products.
HP Blog

HP Software Solutions Blog

Featured


Follow Us
Labels
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.