Posted by Cheyne | Posted in Articles, Dashboards | Posted on 11-03-2011
Often our monitoring systems cause so much noise it’s hard to spot any trends in events. We could be witnessing the gradual downfall of a critical system but we’re too busy closing off alerts or deleting the notification emails to see the pattern of events and put the pieces together.
Service Modelling and traditional threshold style monitoring are great for spotting immediate problems, and discovering the depth of impact from an event but they don’t help you see trends in activity.
Consider the following scenario, a 24/7 data center team rotates its roster nightly, there’s always somebody watching the IT Dashboard, but its generally someone different from day to day. On Monday night a relatively low impact service fails, the monitoring picks it up, the night worker is alerted, he responds, restarts the service and it’s off and working again.
On Tuesday night a different night worker receives a notification that a low priority service has stopped, he knows this service isn’t critical and responds in the next 20 minutes or so, then closes the ticket. And so on.
These low priority incidents are handled, closed, then forgotten about, they are after all, just a low priority incident.
What is being missed here is that each night this service is failing at exactly the same time. Maybe this service isn’t important after hours, but during the day it may be critical. These nightly outages could be an indicator that the service is operating at its maximum capacity. It may just be the added resource consumption of a backup job that’s tipping it over the edge, so what does that say about this services future capacity?
Now consider how many events your monitoring system is receiving a day, how many of those are alerts and how many of those are deemed “low priority”. With appropriate event trending, we can look back at these events and spot trends happening before they turn into something bigger. With more advanced trending we can combine multiple data sources to see how events such as standard change procedures might be contributing to minor incidents.
If we combine our incident and change records with our monitoring data and display these records on an event timeline we get a very clear overall view of whats happening in our environment. We can match up events that are automatically detected by our monitoring systems to manually entered records from our service management software.
A typical example of this would be a change record with zero impact to business stated in its details, yet our event timeline shows several minor incidents occurring immediately after the change window.
Yes, the change may not have directly impacted any services, but it may have caused a disk threshold to breach due to additional logging or in increase in CPU consumption.
Without this trend analysis, these relatively low priority events may have just been passed off as day to day “BAU” problems, rather than showing a distinct relationship to the change record.
So, Lets Get Trending
There are a few options out there for plotting events. Most of them are hard to find because they aren’t really seen as “IT Tools”. One of which is my favorite, and one which I have implemented my self many times in my own monitoring dashboards is the Smile Timeline.
Simile Timeline - http://www.simile-widgets.org/timeline/
The Simile Timeline is shown on they’re website to plot historical events, and the examples they use, plot the Kennedy assassination and all the surrounding events relating to the shooting.
Navigating the events is done simply by dragging the timeline, either from the top band for fine tune time changes, or from the bottom band for larger time changes, clicking the dot icons on the timeline to view the event details.
It doesn’t take you long to look at this example timeline and realize the potential this widget has. Consider my previous example and imagine instead of these historical points in time, that these points were instead showing alerts from your monitoring system, and on top of that change and incident records from your service management software.
Imagine scrolling back in time and seeing trends of events occurring over the past month, you see direct relationships between change records and low priority incidents.
You see multiple low priority incidents in quick succession before a high priority incident occurs. Clearly a trend.
This is how data visualization should be!
This is how I have implemented this widget in the past, by combining multiple data sources and plotting the key events on the timeline.
Below are some screen shots from my previous timeline work.
In the above screen captures, you can see a fusion of change records, incidents and detected issues (Disk Capacity Usage etc). You can clearly see in the first screen shot, a series of service failures and disk capacity issues occur, immediately followed up by 2 high priority incidents. What this does is get you thinking… “how are these related?”
By using different visual cue’s we can create useful information that can be deciphered by simply glancing over the icons without needing to read them. In my usages above, I have done the following.
- Bright Red dot icons, indicate OPEN very high priority incidents. (P1 and P2)
- Dull red dot icons, indicate CLOSED very high priority incidents. (P1 and P2)
- Green dots indicate low priority incidents and monitoring events. Standard disk thresholds, and P3 events etc
- Dull purple dots indicate very low priority events and incidents. P4 incidents, and general monitoring alerts
- The C icons, indicate change records. The lines underneath the change records indicate the change window length
- The lines underneath all the dots indicate the length of the event
- The red marks on the bottom band indicate events. A large amount of events is seen as a bigger red block, as is easily visible from the overview.
So you can see how this relatively small block is able to visually represent events from multiple systems in a clearly readable format. All of the dots on screen can be clicked and the specific event information can be shown (like in the 2nd screenshot)
Almost anyone could read this timeline and understand what has been happening, it correlates the important pieces of information into an easily digestible visual display and helps us understand the relationships between events.
So, How Do We Implement It
This all depends on your systems, and what you wish to display. I cant give you a piece of code to answer this for you, and my implementation was built inside of a Compuware Vantage dashboard, and hence it wouldn’t be useful to anyone else.
I can how ever give you the jist of how its done. The timeline tool reads event information in XML format, (I believe it also does JSON too, but ill be discussing XML) .
The XML format basically maps out event elements, with times, dates, titles and description etc. You need to generate this XML file, and if you wish to include live data (such as from your monitoring system) you’ll want this XML file to be generated dynamically.
The easiest method is to use a server side scripting language like ASP, PHP, Ruby, .Net etc and to retrieve the data from the database, then output as XML. Simply point the Simile timeline to your XML generating script and off you go.
Obviously there is a little bit of work involved, but it’s well worth it. I’m not going to try an explain how to do it, as there’s hordes of information on the Simile website about how to implement the timeline.
With a little bit of work, you can have a very powerful tool leveraging your existing data.