ITIL v3 Foundation Certification Notes: Service Operation [2]
[ITIL® v3 Foundation Notes] Other processes of the Service Operation phase for the ITIL® 4 Foundation Certification exam are covered here, including: Incident Management Process and Problem Management Process. The purpose, objectives, and scope of the processes and their importance in the Service Operation lifecycle stage are addressed. Definition on some generic concepts are also discussed: incident, impact, urgency and priority, problem, workaround, known error and known error database (KEDB).
Article Highlights
Requests vs Incidents vs Problems
- Requests are not incidents as no service has been impacted
- Incident Management Process and Problem Management Process are two of the most important processes in ITIL® and are often the first ones to be implemented
- Incident Management – fix faults as quickly as possible, to resume service, incidents will NEVER become problems
- Problem Management – find the root cause to prevent faults from happening again, to improve overall quality and free up resources needed to deal with repeated incidents
Incident Management Process
- [definition] An incident is defined as an unplanned interruption to an IT service, a reduction in the quality of an IT service, or a failure of a CI (configuration item) that has not yet impacted an IT service.
- Incident Management is responsible for progressing all incidents from reporting to closing – usually the responsibility of service desk.
- Purpose
- to restore normal service operation (defined in SLA) as soon as possible and minimize impact to business operations
- Objectives
- ensure all incidents are responded (logged, managed, resolved and reported) efficiently with standard procedures according to business priority
- improve customer satisfaction
- Scope
- handle all incidents (event which disrupts, or which could disrupt, a service), either by service desk reports or event management tool alerts
- Concepts and definitions
- Timescales – time is of essence, need to log the time and seek improvement
- Incident Models – incident templates with the necessary steps to resolve common incidents, allow faster resolution (stored in the SKMS)
- Major Incidents – define what constitutes a major incident and follow pre-defined procedures, need to inform users on the progress
- Incident Status – the current status of the incident
- Open – identified and logged
- Assigned – sent to a support team
- Allocated or In Progress – a support technician has been allocated
- On Hold – cannot contact the user
- Resolved – completed the work but not confirmed by the customer or awaiting automatic closure
- Closed – accepted by the user
- Expanded Incident Lifecycle – used by the service design availability management process and within CSI, breaks down each step for closer examination to examine the impact of incidents
- Impact – a measure of the effect of an incident, problem, or change on business processes. Impact is often based on how service levels will be affected. Impact and urgency are used to assign priority
- Urgency – a measure of how long it will be until an incident, problem or change has a significant impact on the business
- Priority – a category used to identify the relative importance of an incident, problem or change, based on impact and urgency. High priority (Priority 1) is given the an incident with high impact and high urgency.
- Lifecycle of Incidents
- Incident Identification – realize an incident before the user notices / reports with event management (a reactive process)
- Incident Logging – log ALL incidents for service-level management reporting and problem management
- unique reference number
- incident category, impact, urgency and priority, symptoms, steps to resolution and known errors
- time from logging to closure
- how to identify
- Incident Categorization – use a simple categorization for effective implementation
- Incident Prioritization – consider business impact and urgency, to be completed in a pre-agreed time depending on the priority, may change during the lifecycle
- Initial Diagnosis – the service desk to diagnose the fault and try to resolve it with the known error database (by problem management), incident models or other tools (incident matching)
- Incident Escalation – the incidents are owned by service desk (need to track till closure)
- functional escalation – service desk unable to solve the incident within a given time
- hierarchic escalation – inform management of major incidents / incidents not progressing based on SLA target time
- Investigation and Diagnosis – try to find out what has happened and how to resolve
- Resolution and Recovery – test potential resolutions to ensure the incident has been solved without causing adverse consequences
- Incident Closure – contact user to verify and review categorization, finish documentation. Closed incidents may be re-opened if the incident re-surfaces again. Any appropriate function can close the incident.
- Interfaces with other stages
- [Service Design] Service Level Management, Information Security Management, Capacity Management, Availability Management
- [Service Transition] Change Management, Service Asset and Configuration Management – to identify impact of problems
- [Service Operation] Problem Management, Access Management – security breaches / unauthorized access
Problem Management Process
- [definition] A problem is defined as an underlying cause of one or more incidents. The cause is not usually known at the time a problem record is created, and the problem management process is responsible for further investigation.
- [definition] A known error is a problem that has a documented root cause and a workaround. Known errors are created and managed throughout their lifecycle by problem management. Known errors may also be identified by development or suppliers.
- [definition] A workaround is a way of reducing or eliminating the impact of an incident or problem for which a full resolution is not yet available, workarounds for known errors are documented in known error records. The problem will remain open in this case as the problem is fully resolved.
- Problem Management is the process to investigates the root cause of incidents and implements a permanent solution / workaround to prevent them from happening again
- Not visible to the users / business
- Incidents will not become problems, they must be handled separately
- Although incident and problem management are separate processes, they are closely related and will typically use the same tools, and may use similar categorization, impact and priority coding systems. This will ensure effective communication when dealing with related incidents and problems.
- The time to resolve problem cannot be defined in SLA
- Purpose
- to document, investigate, and remove causes of incidents
- to provide workarounds
- Objectives
- prevent problems from happening
- eliminate recurring incidents
- minimize impact of incidents that cannot be prevented
- Scope
- diagnosis the root cause of incidents
- take steps to eliminate them (with other processes, in particular change management process)
- document problems, workarounds and resolutions (maintain the known error database) for more effective handling of similar incidents
- Output
- Known errors (and entry to KEDB)
- Workarounds
- Resolutions (may include RFCs)
- Concepts and definitions
- Reactive and Proactive Activities – trigger by incidents reporting / analysis of incident trends
- Problem Models – handle problems that have not and will not be resolved (e.g. the cost of a permanent resolution is too high) by some pre-defined workaround
- Lifecycle of Problem Management
- Detecting Problems – identify problems in reactive / proactive ways
- Logging Problems – log in the problem record (link to the incidents)
- Categorizing Problems – same categorization as incident management
- Prioritizing Problems – depends on impact and urgency
- Investigating and Diagnosing Problems – uses CMS and KEDB
- [in some cases] Identifying a Workaround – provides the workaround to service desk for resolving the incident and reassesses the priority
- Raising a Known Error Record – after the root causes has been identified and workaround/solution found for future reference
- Problem Resolution – implement the solution through change management (as emergency change)
- Problem Closure – a permanent solution has been tested and implemented so that the problem will not occur again (user confirmation NOT needed)
- Major Problem Review – lessons learned for proactive problem detection
- Interfaces with other stages
- [Service Strategy] Financial Management for IT Services – to determine whether solution is financially justified
- [Service Design] Availability Management, Capacity Management, IT Service Continuity Management, Service Level Management – problem management supplies the information for solving problems handled by these processes
- [Service Transition] Change Management, Service Asset and Configuration Management – to identify impact of problems, Release and Deployment Management – implement the change, Knowledge Management – KEDB
- [Continual Service Improvement] The Seven-Step Improvement Process – actions are entered into CSI register
Conclusion: ITIL® v3 Foundation Service Operation
This ITIL® v3 Foundation study note touches upon the definition, purpose, objectives and scope of two important processes of Service Operation, namely the Issue Management Process and the Problem Management Process. These two processes also work with processes in other stages of the service lifecycle to provide high quality IT services. Key ITIL® concepts are examined, including: incident, impact, urgency, priority, problem, workaround, known error, known error database (KEDB).
Next in our series of study notes, we will cover Event Management Process, Request Fulfillment Process and Access Management Process of the Service Operation in ITIL®