An Approach to Availability Management, Part 2

Posted by Scott Braden
on January 30, 2006
Category: ITIL Implementation

It’s been another incredibly busy couple of weeks for the Real World ITIL team. It’s a good moment though to record a few more thoughts on our current project, so we’ll continue our story about Availability Management (AM) for this entry. If you’re just joining us, you might wish to read the first entry in this series, because we’ll be referring to the project phases described there.

Before we get started, though, let’s pause to say thanks to Robin Yearsley for his response to one of our recent entries. We’re happy to direct our readers to his new search engine at http://www.ITServiceToday.com, and also to congratulate Robin’s Dr. ITIL website on its first anniversary! May cyberspace be long graced with its ‘onlineness’.

And, as always, we invite thoughtful commentary from all of our readers ? just click on the Comment link below. We’ll be happy to respond to your thoughts.

So why are we focusing on AM in particular? The company-in-question operates financial trading systems as an essential part of its business. In this kind of business, every minute of unplanned downtime might cost millions. You can imagine that the availability of their production systems is a very serious matter indeed.

In fact, before we continue our discussion any further, we must note that the systems at the company-in-question have an excellent record of availability. We’ve seen it in action first-hand, and their production environment without question performs extremely well and the staff maintaining them is exceedingly competent. This is important to keep in mind.

Therefore, throughout this multi-part case study, we are not discussing how to correct a problematic situation but rather how to continuously improve operations in an environment where the stakes for effective infrastructure service delivery are very high.

Getting back to our story, this concern over availability obviously applies to the realtime transaction applications used by traders during the business day. But (perhaps less obviously) it also applies to a number of other supporting technologies, such as the systems that reconcile the company’s books at night and the networks that transport transactions to clearing houses, etc.

As we’ve talked with our client’s staff, however, it’s been interesting to note that, while no one in the company would disagree in principle with our statements in the paragraph above, people in different positions have different perspectives on what our focus should be when attempting to ensure adequate availability of production systems.

For example, it hasn’t necessarily been clear to staff-level server engineers why process improvements are needed in order to improve availability. After all, they’ve invested a great deal of money, thought and effort in designing a fault-tolerant infrastructure.

On the other hand, management (who know this fact) wonder why they must still sometimes invest thought and effort in dealing with the cumulative business consequences of whatever availability issues may have occurred, despite the fact that their applications are running on expensive fault-tolerant platforms. Is there nothing else that can be done to improve uptime?

We often speak of the delivery of infrastructure services as consisting of three components: people, process and technology. When considering how to make availability improvements, we realize that the people part ( human error ) can never be completely eliminated, no matter how competent the staff. As for technology, if your company has already invested in fault-tolerant platforms, then there is limited opportunity for improvement there - certainly so without unduly increasing capital cost. This, of course, leaves us with only process improvement as the best way to improve availability without spending more money.

So here’s where the ITIL framework enters the story.

As we go on, keep in mind the fact that the applications group here submits nearly 3,000 change requests each week. This high rate-of-change is driven by a need to maintain a competitive edge in business applications as well as several other factors such as regulatory changes that may affect applications and data storage.

Given this rate-of-change for production, the company has found it challenging to arrange dedicated windows for performing preventative maintenance for the purpose of supporting existing availability standards. Therefore one of our highest priorities in Phase 1 (taking advantage of the fact that we are also re-engineering the Change Management processes at the same time) is to ensure through appropriate negotiations that the ‘windows’ of planned downtime defined for application changes also allow formally-defined times for activities that serve to improve availability.

We must also ensure that these two types of windows, once defined, are aligned with the
uptime ‘promises’ made to customers by Service Level Management through the Service Catalog. This alignment will help to correct a tacit, longstanding, ‘unwritten rule’ between end-users and the infrastructure group that all systems will be made available 24 x 7 x 365 regardless of criticality to the business. This informal cultural understanding will be replaced by a formal, businesslike policy aimed at providing suitable systems availability in a cost-effective manner.

Another high priority in Phase 1 has been to formally establish an authoritative Availability Plan, in the form of a ‘living’ online document, whose Table of Contents will look something like (in first draft, subject to change) the following:

I. Introduction & Executive Summary

II. Availability Management Mission
…. a. AM Goals & Objectives
…. b. Availability Management Board charter
…. c. Availability Architecture Board charter
…. d. Availability Manager job description
…. e. AM workflow maps
III. Improvement Activities
…. a. General Maintenance Processes & Improvement Plans

…. b. Specific Availability Improvement Initiatives
IV. AM Guidelines
…. a. A record of ‘lessons learned’
…. b. Maintenance schedule definitions
…. c. Maintenance window work protocols
…. d. Security-related standards & guidance
…. e. Fault analysis methods & procedures

V. Interfaces Protocols to Other ITIL Areas
…. a. Service Level Management
…. b. Service Continuity Management
…. c. Financial Management
…. d. Incident Management
…. e. Problem Management
…. f. Capacity Management
…. g. Change Management

Once the Availability Plan has been written, we will of course have to advertise its publication and teach people how (and why) to use it. An Availability Manager - a full-time, mid-level manager that the organization intends to appoint under the executive function for IT Infrastructure, will oversee this. Availability Management, after all, is really one part of the headquarters function of IT - something that needs to apply to all systems and processes.

Has your organization implemented an Availability Plan? Do you agree or disagree with the Table of Contents above? Share your thoughts by clicking on the Comment link below.

Well, that’s all the time we have for blogging this week. We hope this information has been useful and invite you to share your thoughts on Availability Management, too. Next time, we’ll wrap-up this discussion and then move on to new topics.

Until next time, thanks for reading Real World ITIL!

Regards,
Scott (your moderator)

Technorati Tags:

—–

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

© 2005 - 2008 Evergreen Systems, Inc, a provider of ITIL consulting and other IT process improvement services for Fortune 500 clientele. All rights reserved.