TSB Lessons Learned

In April 2018, TSB attempted one of the largest and most challenging IT migrations; moving from a Lloyds Banking Group banking platform to a platform operated by SABIS, an IT provider owned by their new parent company Sabadell. The migration resulted in one of the largest operational failures in financial services in the last decade. TSB did not fully return to business as usual for 232 days following the main migration event.

The incident cost TSB £330 million including customer compensation, fraud and operational losses, additional resource and advisory costs and waived overdraft fees and interest charges. In 2019, Slaughter and May published their independent review of the incident, which reportedly cost TSB £25 million.

At the end of 2022, the FCA and PRA announced fines of £29.8 million and £18.9 million. This brings the total to over £400 million. Alongside this, TSB lost 80,000 customers in 2018, up 62.5% from the year before.

The FCA concluded that TSB had breached two of the FCA Principles. Principle 2, as the firm failed to exercise due skill, care and diligence in managing the outsourcing arrangements with, and services provided by, SABIS (TSB’s IT provider), appropriately and effectively. Along with Principle 3, as the firm failed to take reasonable care to organise and control the Migration Programme responsibly and effectively, or implement adequate risk management systems.

The incident is cited by the regulators as being one of the key drivers behind the Operational Resilience policies released by the PRA, FCA and Bank of England in 2021 which has resulted in large regulatory programmes for regulated financial institutions. In December 2022, the PRA and FCA released 100+ page reports to accompany the confirmation of the fines placed on TSB.

So what learnings can other organisations take from the incident?

‍

Overview for executives & boards

The TSB Board and Bank Executive Committee (BEC) members had the structures and processes to conduct their governance oversight monitoring and assurance duties. However, there were several overlapping and compounding failures of governance and culture which jointly contributed to the main migration event (MME) failure.

These are summarised below and examined further in the next section:

Pre-migration

Discussion and Challenge
There appears to have been insufficient discussion and challenge from the Board and BEC about the risks and dependencies associated with an ambitious data and technology migration to a new platform.

Pressure on Timings and Announcements
In September 2017, it was decided to delay and re-plan the migration, but 9 days later, before the re-planning was complete, it was announced the migration would be in Q1 2018. This may have added implicit pressure to deliver to the new timescale, despite the Board’s previously expressed view that they would only migrate when ready.

Third-Party Assurance
The insufficiency of the Board’s challenge and discussion were exacerbated by a lack of adequate risk management and assurance of critical 3rd (SABIS) and 4th parties’ capability, capacity, and readiness to deliver the technology migration. Certain limited or qualified assurances were not drawn to the attention of the TSB Board.

Testing Lessons Identified Not Learned
There were several lessons identified following the initial migration programme setbacks, which led to the definition of 15 Guiding Principles to guide and test the re-plan. However, these were not implemented in full, with decision-making for this divergence not being escalated to a suitable governance level.

Test Activities – Arbitrary vs. Planned Activity
Test plans were based on a sequential set of activities that should have resulted in layered understanding of any technological issues. However, since the Integrated Master Plan (IMP) and the Defender Plan fell behind schedule, the test approach (including critical aspects of non-functional testing) were arbitrarily modified. Moreover, several test activities were run in parallel to suit the impending timelines with fundamental changes being made to its scope and timing. A lack of rigour also arose from such decisions being made outside of formal governance forums which introduced key risks relating to ‘Active-Active configuration’ in data centres.

Risk and Programme Management Inadequacies
The IMP consistently fell behind the schedule and failed to acknowledge the underlying reasons and risks for delays through to the point of re-plan (known as the ‘Defender Plan’). It led to decisions that moved away from the key guiding principles of the programme. This also ties in with Governance and Culture aspects at the Board level where lack of sufficient challenge once again resulted in no in-depth understanding of risks or rationales for such delays or if these proposed plans were realistically achievable targets.

Post-migration

Incident Management
Whilst an incident management model was in place between TSB and SABIS, there was no joint testing of incident management at a BEC level. In preparation for the MME, the BEC completed 3 incident management exercises, but these only simulated a 48-hour disruption and did not offer the opportunity to explore the challenges of mitigating the impact of a multi-week incident.

Remediation vs. Treatment
Initial focus was on identifying and remediating technical issues, rather than the treatment of customer impacts. It took four days following migration to set up a customer war room and overhaul the customer communications strategy.

Capacity of Workarounds
The planned workarounds and additional capacity for telephony and complaints was supposed to come from other teams within TSB. The aggregate impact of a ‘multiple organ’ failure scenario had not been considered, with these teams being the ‘plan B’ for multiple teams and services. This significantly reduced the overall capacity of the organisation to respond effectively.

Vulnerable Customers
There was a failure to identify and categorise vulnerable customers as part of business as usual or to develop dedicated customer treatment strategies for these customers to invoke during an incident. These efforts would have helped to reduce the number of tabloid front page headlines focused on severe impacts to the minority of the customer base.

So, what can organisations learn from the incident?

Whilst there will always be an inherent tension between budgets, operations, security and resilience, organisations should foster a culture of transparency and collaboration, where employees are encouraged to raise concerns and protected from the consequences of doing so. Put the customer at the centre of your strategy and encourage employees to challenge each other when decisions, both tactical and strategic, cannot be traced back to a customer benefit.
Implement the governance structures to facilitate robust challenge. Diversifying Boards by recruiting Non-Executive Directors that align to your strategy, fill experience gaps, and can ask the challenging questions. Identify the right Resilience metrics that drive the right behaviours and do not allow for risk to be lost or censored as it is summarised for senior audiences. This is central to due care, skill and diligence expected of company directors and officers.
In line with the sentiment of the Operational Resilience Policies, organisations should expect and anticipate that incidents will happen; it is not a case of if, but when. This requires organisations to shift their mindset to consider not just risk but also resilience. Rather than focusing on reducing the likelihood of incidents and disruption, they should also invest in being able to respond and recover quickly.
A failure of imagination when identifying the worst-case scenarios for your organisation may be costly. Many firms when developing scenarios do not assume operational outages longer than 24-48 hours, as they do not have an example of an internal incident that has exceeded this period. Look outside your organisation to your competitors and to other sectors for inspiration. Cyber-attacks that grind organisations to a halt for weeks, even months, do happen and are no longer unthinkable. The effects of climate change and heightened temperatures have caused data centre outages even for global Cloud providers. The lasting impact of COVID-19 lockdowns continues to cause significant supply chain challenges for all sectors. Economic headwinds have put many small independent organisations out of business when coupled with the impacts of Brexit and the Pandemic.
Many organisations tend to focus on technical recovery during incidents ultimately aiming to identify and rectify the root cause. Whilst this is a critical part of responding to the incident, organisations should also have colleagues focused on mitigating the customer impact of the incident with an eye to impact tolerances and intolerable harm, especially in those incidents that are complex and likely to cause extended disruption. This requires organisations to proactively identify the key data they would use to determine customer impact and make this data readily available in an incident. Proactive communications and treatment strategies are key to managing customer sentiment and reaction.
Collaborative testing with critical suppliers provides a safe space to understand roles and responsibilities, reduce the number of assumptions made by both parties and identify gaps in response and recovery strategies. The traditional supplier assurance processes focused on Business Continuity and Disaster Recovery tests that do not provide sufficient confidence in supplier’s resilience, especially when those suppliers supply multiple services to different parts of your organisation.
Understanding critical assets should not stop at those critical for business as usual. Identify your third parties, technology and people critical to response and recovery. If they are not involved in the day-to-day running of your operations, ensure that precious time isn’t wasted during an incident getting individuals vetted, remote access for third parties established (and tested) and contracts including roles and responsibilities in place.
Embed scenario testing into existing Change processes and lifecycles. Functional, non-functional and user acceptance testing are critical and arguably the only way to validate performance, resilience, and capability. Whilst test environments can be both complex and costly, relying solely on a compliance-based approach focused on control testing in place of end-to-end testing will leave you exposed to unanticipated disruptions.

‍

So what learnings can other organisations take from the incident?

‍

Download our Insights Piece

Want to speak to us?

If you would like to discuss a cyber or resilience problem with a member of the team, then please get in touch however you feel most comfortable. We would love to help you and your business prepare to bounce back stronger.

Get in touch

TSB Lessons Learned

Overview for executives & boards

Pre-migration

Post-migration

So, what can organisations learn from the incident?

Related Articles

Ransomware Readiness Framework

April 2025 & Beyond

Third-Party Resilience and Its Impact on the UK Financial Sector

Want to speak to us?