It’s been another incredibly busy couple of weeks for the Real World ITIL
team. It’s a good moment though to record a few more thoughts on our
current project, so we’ll continue our story about Availability
Management (AM) for this entry. If you’re just joining us, you might
wish to read the first entry in this series, because we’ll be referring to the project phases described there.
Before we get started, though, let’s pause to say thanks to Robin Yearsley for his response to one of our recent entries. We’re happy to direct our readers to his new search engine at http://www.ITServiceToday.com, and also to congratulate Robin’s Dr. ITIL website on its first anniversary! May cyberspace be long graced with its ‘onlineness’.
And, as always, we invite thoughtful commentary from all of our readers ? just click on the Comment link below. We’ll be happy to respond to your thoughts.
So why are we focusing on AM in particular? The company-in-question
operates financial trading systems as an essential part of its
business. In this kind of business, every minute of unplanned downtime
might cost millions. You can imagine that the availability of their
production systems is a very serious matter indeed.
In fact, before we continue our discussion any further, we must note
that the systems at the company-in-question have an excellent record of
availability. We’ve seen it in action first-hand, and their production
environment without question performs extremely well and the staff
maintaining them is exceedingly competent. This is important to keep in
mind.
Therefore, throughout this multi-part case study, we are not
discussing how to correct a problematic situation but rather how to
continuously improve operations in an environment where the stakes for
effective infrastructure service delivery are very high.
Getting back to our story, this concern over availability obviously
applies to the realtime transaction applications used by traders during
the business day. But (perhaps less obviously) it also applies to a
number of other supporting technologies, such as the systems that
reconcile the company’s books at night and the networks that transport
transactions to clearing houses, etc.
As we’ve talked with our client’s staff, however, it’s been
interesting to note that, while no one in the company would disagree in
principle with our statements in the paragraph above, people in
different positions have different perspectives on what our focus
should be when attempting to ensure adequate availability of production
systems.
For example, it hasn’t necessarily been clear to staff-level server
engineers why process improvements are needed in order to improve
availability. After all, they’ve invested a great deal of money,
thought and effort in designing a fault-tolerant infrastructure.
On the other hand, management (who know this fact) wonder why they
must still sometimes invest thought and effort in dealing with the
cumulative business consequences of whatever availability issues may
have occurred, despite the fact that their applications are running on
expensive fault-tolerant platforms. Is there nothing else that can be
done to improve uptime?
We often speak of the delivery of infrastructure services as
consisting of three components: people, process and technology. When
considering how to make availability improvements, we realize that the
people part ( human error ) can never be completely eliminated, no
matter how competent the staff. As for technology, if your company has
already invested in fault-tolerant platforms, then there is limited
opportunity for improvement there - certainly so without unduly
increasing capital cost. This, of course, leaves us with only process
improvement as the best way to improve availability without spending
more money.
So here’s where the ITIL framework enters the story.
As we go on, keep in mind the fact that the applications group here
submits nearly 3,000 change requests each week. This high
rate-of-change is driven by a need to maintain a competitive edge in
business applications as well as several other factors such as
regulatory changes that may affect applications and data storage.
Given this rate-of-change for production, the company has found it
challenging to arrange dedicated windows for performing preventative
maintenance for the purpose of supporting existing availability
standards. Therefore one of our highest priorities in Phase 1 (taking
advantage of the fact that we are also re-engineering the Change
Management processes at the same time) is to ensure through appropriate
negotiations that the ‘windows’ of planned downtime defined for
application changes also allow formally-defined times for activities
that serve to improve availability.
We must also ensure that these two types of windows, once defined, are aligned with the
uptime ‘promises’ made to customers by Service Level Management through
the Service Catalog. This alignment will help to correct a tacit,
longstanding, ‘unwritten rule’ between end-users and the infrastructure
group that all
systems will be made available 24 x 7 x 365 regardless of criticality
to the business. This informal cultural understanding will be replaced
by a formal, businesslike policy aimed at providing suitable systems
availability in a cost-effective manner.
Another high priority in Phase 1 has been to formally establish an
authoritative Availability Plan, in the form of a ‘living’ online
document, whose Table of Contents will look something like (in first
draft, subject to change) the following:
I. Introduction & Executive Summary
II. Availability Management Mission
…. a. AM Goals & Objectives
…. b. Availability Management Board charter
…. c. Availability Architecture Board charter
…. d. Availability Manager job description
…. e. AM workflow maps
III. Improvement Activities
…. a. General Maintenance Processes & Improvement Plans
…. b. Specific Availability Improvement Initiatives
IV. AM Guidelines
…. a. A record of ‘lessons learned’
…. b. Maintenance schedule definitions
…. c. Maintenance window work protocols
…. d. Security-related standards & guidance
…. e. Fault analysis methods & procedures
V. Interfaces Protocols to Other ITIL Areas
…. a. Service Level Management
…. b. Service Continuity Management
…. c. Financial Management
…. d. Incident Management
…. e. Problem Management
…. f. Capacity Management
…. g. Change Management
Once the Availability Plan has been written, we will of course have
to advertise its publication and teach people how (and why) to use it.
An Availability Manager - a full-time, mid-level manager that the
organization intends to appoint under the executive function for IT
Infrastructure, will oversee this. Availability Management, after all,
is really one part of the headquarters function of IT - something that
needs to apply to all systems and processes.
Has your organization implemented an Availability Plan? Do
you agree or disagree with the Table of Contents above? Share your
thoughts by clicking on the Comment link below.
Well, that’s all the time we have for blogging this week. We hope
this information has been useful and invite you to share your thoughts
on Availability Management, too. Next time, we’ll wrap-up this
discussion and then move on to new topics.
Until next time, thanks for reading Real World ITIL!
Regards,
Scott (your moderator)
Technorati Tags: “itil” “infrastructure” “blog” “manager” “configuration manager” “rss” “strategic plan” “implementation”