Banking Outages


#1

With TSB experiencing outages since Friday with online and mobile banking due to migrating from the systems of Lloyds Banking Group. I am interested to hear from Starling what would happen if AWS had a major outage? and how they would handle it.


#2

Who is AWS, Joe?


#3

Amazon Web Services


#4

@JayTay beat me to it


#5

And there was I thinking it meant Abyss Web Server or Authenticated Web Server.


#6

Well their reliability is above 99% so at most down for hours and not days.

All banks will have downtime but because the legacy banks use legacy systems the downtime is usually a lot worse or in this case a complete disaster.


#7

Having worked on a number of “legacy” systems I have to disagree. The big RBS outage in 2012 was a software update gone wrong and the current TSB issue was due to the migration to Sabadell’s own software platform. The Lloyds outage early last year was down to DDOS attacks.

AWS is not immune to outages either and of course Monzo had problems late last year so it’s not out of the question that Starling could fall victim to any one of the above!


#8

if anyone is curious, this is the Service Level Agreement

https://aws.amazon.com/compute/sla/


#9

You might be interested in this, if you haven’t seen it already.


#10

Really depends on which AWS service you are on about, in general over the last 5 years the two major outages on AWS were ECS and both down to human error. Minor instances never tend to last for long, but that’s the nature of cloud services.

Financial Services generally use EC2 and EMR amongst others which have never had a major outage in AWS history in Europe, they have had issues in North America but none have been major outages. Amazon have specific services and processes in place for failure injection processes for the financial sector.

Capital One has been using AWS for years, it’s not had a major outage in all that time in Europe, although again in the USA it has. There is another big provider that uses AWS but I can’t remember them of the top of my head.


#11

Are you suggesting a backup on Azure, or another alternative?


#12

actually that is a good idea.

Lloyds Bank for example have two physically separate data centres as I understand it

At the business launch event I questioned this and was advised they have multiple AWS instances in multiple locations but I would think that to create true redundancy that you should have a live AWS environment and a backup in either Azure or Google Compute Cloud. That would be the closest approximation to multiple data centres.

Just my $0.02 and not a criticism :slight_smile:


#13

Multi cloud should be the way to go. Multiplay did a very interesting talk (I believe this one?) about how they use multiple cloud providers as well as bare metal together to reduce costs and provide high availability.

I would be pretty keen for something as important as my bank to be thinking about these sorts of solutions.


#14

Hi @Joe_Merriman,

Good question and sorry for my delay responding. I’m really pleased you posted one of Greg’s talks as it’s definitely better to hear it from the horses mouth so to speak.

In a brief response to your original question, AWS have experienced occasional outages (as @JayTay says). We run all our services in multiple availability zones (data centres) simultaneously, handling a share of the traffic. If one building were to experience issues, the other would automatically take the load. This includes the databases which will fail over automatically.

It would be rare to have a full region outage (all 3 data centres). If it were to occur though it would affect us but we would be able to restore service from another region and we already run some services outside of AWS (in Google Cloud Platform) and over time more services will distribute across multiple hosting providers and so the impact and recovery time reduced.

I’m adding @steve and @sam to answer any more questions on this front.

Sarah


#15

Thank you @sarah.guha :+1:t2: If @sam and @steve want to geek :nerd_face: us out more, please do :slight_smile:


#16

Is everything under the starlingbank.com domain name in terms of infrastructure and customer-facing services?

A potential SPOF is if CloudFlare goes down. The no-ip debacle caused by the Premier League, the recent route53 BGP hijack, among other DNS provider outages, means anyone relying on a single provider for authoritative DNS hasn’t learnt from those that have had (avoidable) major outages.

Registry:

; <<>> DiG 9.9.5-9+deb8u15-Debian <<>> -tns starlingbank.com @d.gtld-servers.net
...
starlingbank.com.       172800  IN      NS      lily.ns.cloudflare.com.
starlingbank.com.       172800  IN      NS      thomas.ns.cloudflare.com.

Authoritative:

; <<>> DiG 9.9.5-9+deb8u15-Debian <<>> -tns starlingbank.com
...
;; ANSWER SECTION:
starlingbank.com.       86400   IN      NS      lily.ns.cloudflare.com.
starlingbank.com.       86400   IN      NS      thomas.ns.cloudflare.com.

With a 48 hour TTL for the NS records at the registry, if the app is relying on your domain name being resolvable there is the potential for a third party causing a 48 hour outage for many customers (assuming you add a secondary provider the instant the outage occurs).

Not to mention your e-mail system failing with inbound e-mail bouncing because the MX record can’t be looked up and outbound e-mail bouncing because the DKIM/DMARC/SPF records can’t be looked up.


#17

I have literally no idea what you just said!! :thinking:


#18

Maybe join the developer slack channel, Just a thought.


#19

Isn’t Cloudflare one of the biggest providers of managed DNS services in the world? I’m sure the two nameservers are globally diverse.

Nice digging though :slight_smile:


#20

Cloudflare are one of the largest global DNS providers in the world so a complete outage of two name servers would be unlikely. If the lily NS was to go down the thomas NS would take the load performing DNS lookups.

If the proxy service was to go down starling could simply disable it while retaining the cloudflare NS, really for any sort of impact to service both of those NS would need to be knocked out which is highly unlikely.