Infrastructure Management Software Blog

How to frustrate end users

 


I'm currently on the receiving end of some pretty dismal service quality from my broadband (ADSL) provider. I work from home much of the time so when my broadband is unavailable I really feel unproductive and disconnected. 


 


The issues have been ongoing for a while and relate to some problems with the equipment in the local exchange.


 


I'm not going to delve into great detail or "name and shame" but I did want to take the opportunity to take a step back and use my unhappy experience as a learning opportunity. I think there are a couple of lessons in the interactions that I've had with my ISP which provide great guidance on things to avoid if you don't want to frustrate your end-users as you strive to deliver IT services to your business and your customers.


 


The first thing to avoid is a lack of transparency across the teams who are involved in service delivery. When I call the technical help line to report a red light on my ADSL Router I expect someone to KNOW that there is a problem - and to be able to explain what is being done.


 


Often times this has not been the case.


 


In many cases the help desk don't know that there is an issue. This frustrates me (I'm being used as the monitoring device) but also wastes a lot of time as they tend to follow a process of standard tests before they discover that there is a known problem at the broadband exchange.


 


Bottom line is that if your infrastructure monitoring can detect a service impacting issue then that information should be shared with all the folks who can make use of it. You need to use the information to update your help desk / service management systems so that the folks who are front and center talking to customers appear informed and can provide the customer with reassurance that the IT organization has it's act together.


 


I'm not suggesting that every infrastructure event needs to be visible to the help desk - but if it's got a high probability of affecting service delivery then it has value. And whatever is shared has GOT to be accurate. And the information needs to be updated with current status and estimated fix times so this can be relayed to customers to provide some reassurance that the technicians are dealing with the issue and some expectations on when service will be resumed.


 


We provide interfaces for this stuff for our own monitoring solutions, such as Operations Manager or BAC, into HP Service Manager (and some third party help desk packages) because we believe it's a vital part of how an incident management process connects to service management activities.  This is essential stuff that you need to be able to do as part of what we describe as a Closed Loop Incident Process (CLIP).


 


The second 'thing' that drove me nuts happened mid-afternoon the day before yesterday. I called up because the broadband was down again and the response from the help desk was that the exchange was "having an upgrade".


 


Now two possibilities spring to mind here.


The first (driven by disbelief) is that this is incorrect information - either the status in the fault record is wrong (read earlier comments about sharing accurate information) or the help desk person is trying to fob me off - and that's another big no-no if you want happy customers.


 


The second possibility (also driven by disbelief) is that someone needs to take an ITIL class and understand the basics of configuration and change management as they relate to Service Delivery / Service Management. Taking a broadband exchange offline for 4 hours on a weekday afternoon to perform an upgrade appears, on the face of it, to be a little ill-considered.


 


If you want to understand more about how HP can help with Change and Configuration management then take a look at our Service Manager product.  To be able to plan change effectively you need good, up to date configuration information regarding the CIs (configuration Items) and how they relate to each other and support IT Services. That's not something you can maintain manually - at least not cost effectively - so some automated discovery is essential - and we can help there with our discovery (DDM) and UCMDB technologies.

Event Correlation: OMi TBEC and Problem Isolation - What's the difference (part 2 of 3)

If you have not done so already, you may want to start with part 1 in this series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-1-of-3.aspx


This is part 2 of 3 of my discussion of the event correlation technologies within OMi Topology Based Event Correlation (TBEC) and Problem Isolation. I'm going to focus on talking about how TBEC is used and how it helps IT Operations Management staff be more effective and efficient. My colleague Michael Procopio has discussed PI in more detail over in the BAC blog here: PI and OMi TBEC blog post 


If you think about an Operations Bridge (or "NOC"… but I've blogged my opinion of that term previously) then fundamentally its purpose is very simple.


 


The Ops Bridge is tasked with monitoring the IT Infrastructure (network, servers, applications, storage etc.) for events and resource exceptions which indicate a potential or actual threat to the delivery of the business services which rely on the IT infrastructure. The goal is to fix issues as quickly as possible in order to reduce the occurrence or duration of business service issues.


 


Event detection is an ongoing process 24x7 and the Ops Bridge will monitor the events during all production periods, often 24x7 using shift based teams.


 


Event monitoring is an inexact discipline. In many cases a single incident in the infrastructure will result in numerous events – only one of which actually relates to the cause of the incident, the other events are just symptoms.


 


The challenge for the Ops Bridge staff is to determine which events they need to investigate and to avoid chasing the symptom events. The operations team must prioritize their activities so that they invest their finite resources in dealing with causal events based on their potential business impact, and avoid wasting time in duplication of effort (chasing symptoms) or, even worse, in chasing symptoms down in a serial fashion before they finally investigate the actual causal event, as this will extend the potential for extended downtime of business services.


 


TBEC helps the Operations Bridge in addressing these challenges. TBEC works 24x7, examining the event stream, relating it to the monitored infrastructure and the automatically discovered dependencies between the monitored components. TBEC works to provide a clear indication that specific events are related to each other (related to a single incident) and to identify which event is the causal event and which are symptoms.


 


Consider a disk free space issue on a SAN, which is hosting an oracle database. With comprehensive event monitoring in place, this will result in three events:



  • a disk space resource utilization alert

  • quickly be followed by an Oracle database application error

  • and a further event which indicates that a Websphere server which uses the Oracle database is unhappy


 


Separately, all three events seem ‘important’ – so considerable time could be wasted in duplicate effort as the Ops Bridge tries to investigate all three events. Even worse, with limited resources, it is quite possible that the Operations staff will chase the events ‘top down’ (serially) – look at Websphere first, then Oracle, and finally the SAN – this extends the time to rectification and increases the duration (or potential) of a business outage.


 


TBEC will clearly show that the event indicating the disk space issue on the SAN is the causal event – and the other two events are symptoms.


 


In a perfect world the Ops Bridge can monitor everything, detect every possible event or compromised resource that might impact a business service and fix everything before a business service impact occurs.


 


The introduction of increasingly redundant and flexible infrastructure helps with this – redundant networks, clustered servers, RAID disk arrays, load balanced web servers etc. But, it also can add complications which I’ll illustrate later.


 


One of the challenges of event monitoring is that it simply can NOT detect everything that can impact business service delivery. For example, think about a complex business transaction, which traverses many components in the IT infrastructure. Monitoring of each of the components involved may indicate that they are heavily utilized – but not loaded to the point where an alert is generated.


 


However, the composite effect on the end to end response time of the business transaction may be such that response time is simply unacceptable. For a web based ordering system where customers connect to a company’s infrastructure and place orders for products this can mean the difference between getting orders or the customer heading over to a competitors web site.


 


This is why End User Monitoring technologies are important. I'll talk about EUM in the next, and final, edition of this blog serial.




Read part 3 in the series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-3-of-3.aspx



For HP Operations Center,  Jon Haworth.

Innovation Week Part 6 - End-User Monitoring

When I hear about innovation, I think about a breakthrough product or a significant update in functionality. In marketing parlance, this means a new version number 8.x to 9.x, or maybe a major “dot” release 9.0 to 9.1. So, when my colleague Amy Feldman in Business Availability Center told me about big innovations in Real User Monitoring (RUM) version 8.02, I was obviously skeptical.


First, how does RUM relate to managing IT infrastructure? The VP of Operations is tasked with providing users an optimal level of performance and availability for the business services they consume. So, assessing these metrics from the end-user standpoint is critical for maintaining service level objectives.


HP offers two ways to do this: real user monitoring and synthetic user monitoring. The former monitors actual user performance while the later simulates user sessions from different geographies. Pulling alerts from these tools into your event console gives you early warning of end-user experience issues. In an era of cost reduction, this shifts monitoring to relatively inexpensive automated systems from the most expensive monitoring tools known - your customers calling to report problems or “fixing” their negative experience by taking their business elsewhere.


So, what is new and exciting in RUM 8.02?



  • The HP RUM probe is now supported on Windows and Linux Redhat5 64-bit. 

  • RUM now supports detailed transaction analysis for 7 different protocols including MsSQL, LDAP, MySQL, IMAP, POP3, SMTP, FTP and increased insights into non-web based applications.

  • Supports larger SiteScope deployments – increased monitor capacity per server (up to 16,000 monitors), simplified mass template deployment, easier way to manage SSL certificates, and other usability enhancements.

  • Expanded SiteScope monitor coverage – new platform support (NonStop, SolarisZones), added versions and new solutions templates allows you to expand SiteScope monitoring into new environments.

  • Expanded support for Business Process Monitor (BPM) – now supports VuGen 9.5 and Windows Vista.

  • BAC dashboard integrations
    1. Netuitive integration: enables integrating Netuitive predictive alerts into the BAC dashboard
    2. iPhone integration: enables viewing real-time status of applications and business services, via dashboard, directly on the iPhone


RUM screen


Not bad for a “minor” version.


For Operations Center, Peter Spielvogel.


 

Search
Follow Us


HP Blog

HP Software Solutions Blog

Labels
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation