Being able to put a nice round percentage figure on your critical services availability is a very attractive piece of information for management and overall SLA Management.
Having all the latest and greatest monitoring tools and technology is great for keeping the lights on, and assuring uptime, but all this technology mumbo jumbo doesn’t translate into a business compatible language.
Often we have terabytes of data, thousands and thousands of active alerts and performance trending metrics, although we really only need a select few key pieces of information to keep the business happy, the main one being availability.
Availability can be tricky to obtain, and although the formula for actually calculating the figure isn’t a difficult one, the source data and its reliability is where the complexities lie.
All too often you see software companies plugging their products, claiming to provide “Complete Availability Management” functionality and “ITIL Compliance”, usually implying that by implementing their product that you will be able to automatically produce business and executive ready reports, detailing service availability, SLA target information and statistics. More often than not, it’s accompanied with a very slick sales pitch, an overly simplified product video and expensive stock photographs with cliché pictures of business men in suits kicking back with their feet on the table reading a news paper, implying that the mere implementation of their product is going to make your bosses happy and your life easier.
We’ve all seen it, however the smarter ones out there (and chances are if you’re reading this, that’s you) know better.
It’s not completely false, all these slick products will do some form of availability reporting, however it’s not the sort of reporting you’ll be handing over to your CEO anytime soon, at least not without a bit of manipulation.
You see, not all availability reporting is the same. The common misconception is that the IT Department is able to implement its complex monitoring platforms (SCOM, Patrol, BAC, Nagios, Emite Analytics, Vantage etc) and provide a single report at the end of the month for all areas of the company to digest.
It’s not quite that simple, different datasets should be used to produce availability reports for different area’s of the organization. What the business want to see, is generally not what IT want to see.
IT Availability vs Business Availability
There is no single measure for availability, or at least not one that will translate from IT to Business. The formula for calculating these two availability figures is the same, however the source data will vary. IT Availability and Business Availability are two terms I have coined my self in past job roles, they provide more relevant statistics tailored for both the business and IT sides of an organization, let me explain.
Business Availability is the traditional availability, this is what most people think of when they ask for availability figures. Business availability involves calculating the time a service has been down against a specified time period to result in a nice round, typically 2 decimal place percentage. Ie, 99.95%.
This figure does not take into account periods at which the service was degraded, or if the service was in breach of a threshold before failure, it is simply an indicator of the overall uptime for a service, regardless of its state or health.
This is generally the figure the business is interested in, and is the figure that Service Level Agreements are written by. It is common to see SLA contracts stating 99.5% for Business Availability, which is basically an allowance of around 2 hours downtime a month.
Business Availability can generally be calculated without the use of infrastructure monitoring tools, and more commonly calculated using Service Management metrics, such as high priority incident records.
IT Availability is a more proactive availability figure. IT Availability is more commonly generated from infrastructure monitoring tools, and takes into account degraded service quality and metrics that are crossing thresholds on a regular basis although may not actually be a full service outage.
Let me explain, consider you have a service that is made up of 1 server (we’ll use 1 for the sake of simplicity) , for this server you have some base level monitoring for logical disks, free memory, CPU usage and page file usage. You have thresholds configured for these base metrics. These may be
- Logical Disk: 80% – 94% = Warning | 95% – 100% = Red Alert
- Total CPU Usage: 0% – 98% = OK | 99% – 100% = Red Alert
- Memory Free Space: 400MB and under = Red Alert
Now consider the following model.
This is a typical model found in your infrastructure monitoring systems. The metrics are related to the server, the server is related to the business or IT service, and the service will most likely be related up again to the corresponding business unit or function etc.
Here you can see the CPU Usage counter has crossed a threshold, this triggers an alert which flows up the tree to the server, and further up again to the business service. (this model isn’t taking in account any specific weightings, and is working a ‘worst child’ style relationship).
The monitoring software will begin recording an outage against the business service and being a CPU Usage alert, it would most likely mean service quality has been degraded, but is still operational.
Chances are that users are experiencing slower response times, how ever as the service is not down, there may be no incidents recorded against the service, or at least none with any recorded outages (P3 or P4 maybe), and hence if this were a business availability report, it would still be 100% available, this is where the term “IT Availability” comes in.
IT Availability is impacted from this point as the monitoring has detected there is an issue and has flagged an alert, the alert has then flowed up the tree to the business service object.
The monitoring software would have begun recording downtime from the moment the business service went red. The IT dept will get the alerts regarding the CPU usage and respond accordingly.
You can imagine, with a troublesome server, this may happen several times before a resolution may be put in place, such as new hardware, load balancing etc. This may take days before it’s restored to normal state. Triggering an outage each time on the parent business service .
Now, how do you think that is going to impact your business agreed SLA of 99.5% availability?
This is why it’s a bad idea to use infrastructure monitoring availability statistics with service level agreements or business availability reports. Often what people will do, is manually reconcile all of these threshold breaches by hand so as to flag which of these alerts actually caused an outage to the service, and to adjust the outage window on those that did to reflect the actual outage period, how ever this is extremely manual , and may affect your IT Availability metrics.
The availability stats straight out of your monitoring tool is your “IT Availability”, it would be no where near matching the targets set in your business defined SLA’s , and most likely show figures around the 94% mark for an average service with the occasional problem.
IT Staff can use these figures to get an overall look at how a service is performing, and roughly how many issues are occurring without losing the details that are lost in a business availability report.
At a quick glance, an IT Availability figure of 90% tells the reader that there have been issues through the month, where as a Business Availability may show 100% and completely hide the fact that breaches have occurred.
A service may have a 100% business availability, but have an 80% IT Availability. This could be due to a disk counter threshold sitting at 96% for a full week of the month. IT Staff may have determined the disk was not going to fill any more, and was not at risk of bringing the service down, therefore did not react to the alert immediately.
The 80% figure at the end of the month however will show that there was an outstanding issue for a large portion of time. Trending back over longer periods of time may show if this service has repetitive issues per month, and can highlight the need for future infrastructure upgrades etc, even when the service its self has never been down, and the business is still seeing a 100% service availability.
Calculating availability is fairly simple. The formula for calculating the monthly availability of a service is as follows
100 – [(Outage Mins For Month / Total Mins In Month) x 100]
This formula applies to both IT and Business availability, however obviously IT will have larger numbers in the ‘Downtime Minutes For Month’ part.
We use minutes instead of hours so that the final figure is more granular, and can produce figures into 2 decimal places. Such as 99.52% instead of just 99%.
If a service has 175 minutes down time during the month, and there’s 30 days in the month.
- 30 days = 43,200 Minutes
- (175 / 43200) x 100 = 0.405 (Rounded to 3 decimal places)
- 100 – 0.405 = 99.595
- Availability = 99.59%
The Actual Downtime Figures For A Service – Using Incident Records
Getting the actual outage times for a service to compile a business availability report, and not just the recorded threshold breaches can be tricky. In most organizations, there will be some form of ITIL implementation, to record and grade incidents that occur. Typically, a priority 1 and 2 incident indicates a full service outage, or at least a large portion of the service down. Priority 3 and 4 is reserved for slight service degradation and general faults that aren’t causing an outage.
If your company does not implement ITIL, that’s fine, but you’ll most likely have some mechanism for handling issues that occur, and record the outcomes. This can be used in the same way. For this article though, I will be using ITIL terminology (ie Incidents)
It’s good practice to implement an “Outage Start” and “Outage Finish” field in your incident records, under your incident records, and force the fields to be filled in before closing a priority 1 or 2 incident. Typically this would be entered by the company incident manager or some designated party rather than leaving it up to any random engineer, this way you can be assured of data quality and consistency.
What you end up with now is clean cut records at the end up the month that you can query against to get the cumulative outage times. These outage times would have been manually entered by the incident manager, who has been involved with the incident from the start and is aware of the overall outage time and impact to the business, as opposed to say, the responding engineer who may have only known about the outage from the point at which he was engaged.
Ok, Lets Get Real, Show Me Some Code
Alright, so you’ve got these figures. You’ve got the real outage times per month attached to your incident records.
Your incident records will contain the following key bits of information
- The service affected
- Outage start
- Outage end
- Opened time
Assuming the standard scenario, we’re looking to generate availability stats for the last calendar month, we’ll use these fields to query for all incidents in the past month for specific service, we’ll then perform a DateDiff comparison and apply our availability formula to get the final percentage figure.
The following example could be used, providing you adjust the field and table names to match your table structure. There’s really nothing database specific about this sample query. Providing you have a table somewhere with the information listed above, you should be able to adapt this query to your dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
-- Create Variables, And Calculate The Number Of Days In The Month DECLARE @DAYSCOUNT INT, @MONTHMINS FLOAT SET @DAYSCOUNT = datepart(dd,dateadd(dd,-1,dateadd(mm,-1,cast(cast(year(getdate()) as varchar)+'-'+cast(month(getdate()) as varchar)+'-01' as datetime)))) -- Calculate The Number Of Minutes In The Month SET @MONTHMINS = @DAYSCOUNT * 24 * 60 -- Select The ServiceName And Calculate The Availability Percentage Based On The Table ‘INCIDENTS_TABLE’ SELECT ServiceName, CAST(100 - (SUM(DATEDIFF(MINUTE, Outage_Start, Outage_End)) / @MONTHMINS) * 100 AS DECIMAL(5,2)) as Availability FROM INCIDENTS_TABLE -- Filter Records To Only Those Opened Last Month WHERE DateDiff(MONTH,Opened_Time, GETDATE()) = 1 -- Filter To Your Specific Service AND ServiceName = 'Some Service Name'
The above query, would output the following: