Systems failure

Systems failure is a significant business risk

Our growing reliance on technology means it is increasingly common for organisations to suffer system failure, causing significant distress to their operations. Recent examples of how companies have been impacted by systems failure include:

An airline facing a computer outage caused by a fire impacting its data centre at the group’s headquarters. Around 2,300 flights were cancelled and hundreds of thousands of passengers were delayed.
A bank experiencing systems failure meant 600,000 customer payments and direct debits went missing. The failure was caused by the bank’s IT infrastructure struggling to deal with traffic volumes.
A large metals producer saw systems fail after a ransomware attack, forcing it to revert to manual operations on some processes. The estimated cost of the incident for the company was $50 million.

Such incidents have the potential to cause severe financial, operational and reputational problems for the organisation, including high costs associated with managing the fallout.

Consumer reactions and regulatory responses to a system failure can result in lost customer revenues and substantial reputational damage. News of the failure can spread extremely quickly on social media, and at the same time regulators’ expectations are increasing. As a result, effective and rapid situational response and crisis management need to be a strategic priority.

Extraordinary challenges

Rapidly and effectively responding to a significant systems failure can be challenging. Some of the common obstacles include:

Lack of assigned responsibility and accountability regarding who owns different elements of the crisis response.
Inability to execute an integrated, cross-functional response.
Lack of experienced, knowledgeable and available experts who can immediately assist with the issue.
Inability to rapidly ramp up resource in call centre teams to respond to the surge in both customer and supplier inquiries.
Difficulty obtaining the necessary data and information required to organise the response e.g. requests for information from the regulator.
Inability to accurately track and manage the costs or scope of the failure, due to system and data challenges.

These challenges highlight the complexities in managing the fallout of a significant systems failure and rapid mobilisation of an effective remediation operation.

Four pillars of effective systems failure response

Taking definitive action within the first 48 hours is critical. In our experience, many companies are unprepared and lose time during this vital period. They focus on organising a response team and obtaining information to enable decisions to be made, whilst the story continues to progress unchecked. As a result, the issue continues to grow. Below, we identify the four pillars of successful systems failure response.

Resilience planning and process design

The remediation team should review and benchmark the current IT service management processes, architectural management approach, service development method and technologies that contribute to the IT resilience capability.

Performing end-to-end IT service risk mapping, by conducting a series of deep dives to develop risk maps that illustrate where and why technology resilience risks are concentrated in the IT estate.

Information management

Companies should be able to:

assess the scope of a failure.
track and document remediation activity.
respond quickly and accurately to requests for data from the regulator or other external stakeholders
track all remediation costs and KPIs.

This requires robust information and project management platforms.

Understanding the technical estate

A rapid understanding of the end-to-end technical estate is essential. In particular, developing a view of how technologies in place operate and interact to ensure any interdependencies are identified and further systems risks are mitigated during an incident.

Stakeholder communications

Companies need to communicate effectively with regulatory agencies and other key stakeholders, including employees, customers, suppliers, insurance carriers, investors, and the board. The response team must determine what should be communicated, how and when.

How PwC can help?

Rapid systems review

We can quickly understand the end-to-end technology landscape relevant to the incident; for example:

Applications impacted and dependencies e.g. middleware.
Data centres and infrastructure hosting environments.
Third party cloud providers.
Network dependencies including DC LAN, WAN links.

Crisis response

We can set up and run a crisis ‘war room’ to lead your crisis response, or shore up your existing strategic and operational capacity when needed. We can help prepare or review your crisis response strategy, governance, crisis communications and stakeholder management plans. We can mobilise within hours to provide operational, regulatory and legal support, as well as technical analysis.

Data capture

When a systems failure occurs, a successful response is dependent on your ability to access all relevant data. Identifying structured data sources in a complex environment calls for particular technical capabilities and technology tools. Our specialist team are experts in capturing data, from ERP systems to trading and point of sale systems.

Data analytics

When crisis hits you, understanding large and complex datasets can be difficult, yet a successful outcome is dependent on your ability to analyse and draw meaningful conclusions. Our skilled data scientists are experts in analytics and visual representation of data. We make your complex data easy to use and understand.

When to get in touch?

Systems failure can be extremely complex and highly disruptive. PwC can rapidly scale up your response to limit the financial, regulatory and reputational impact. To find out more, please get in touch with our dedicated team.

How PwC has supported clients in crisis

International aerospace company

Nature of failure: IT outage that significantly disrupted the company’s operations.

How we helped: PwC was engaged to undertake a rapid post-incident technical and operational review to help isolate the root cause of the incident and provide a set of recommendations to help prevent similar outages.

We provided a chronological incident report mapping the end-to-end technology landscape and assessing the current delivery model, including the functions provided by third party providers.

A major government organisation

Nature of failure: Significant IT outage that caused systems to be inoperable for ten days

How we helped: PwC was engaged to investigate the incident. We found that there were multiple single points of failure in the hosting technology, which the client understood to be resilient. We also found that key service and supplier management processes were not sufficient to manage system resilience. We then helped the client through the remediation programme, from both a technology and process perspective.