Skip to content

Key Takeaways from the AWS Outage

Today’s major outage of AWS, centered around DNS resolution and an underlying internal subsystem responsible for monitoring the health of EC2 network load balancers, caused many major services to be down resulting in end-users being directly impacted. Remediation was delayed because support case creation was also affected through the AWS Support Center or the Support API. These kinds of failures help remind us that in cloud-native software the difference between spinning up a “working” system and architecting an enterprise solution are the pillars of resiliency, availability, operational excellence and security. Cloud-native systems are immensely powerful but can only be leveraged effectively if you build and architect with intent.

Some of my key takeaways from the outage from this event:

  1. Avoid single points of failure
    • This issue impacted a single AWS region
  2. “Hope for the best, prepare for the worst”
    • Just because you move to the cloud does NOT mean you don’t have to worry about availability. Don’t assume because it is in the cloud means it will always “just” work
  3. Utilize failovers, backups and quick communication
    • Even something as simple as alerting the users goes a long way to improving the user experience. For example, AWS health status dashboard. With the dashboard, AWS is bringing transparency to service health history in near real time allowing users to be informed and take immediate action.
  4. Have clear incident playbooks for remediation and regularly practice DR events by conducting game days
    • At a minimum, you should be creating multi-AZ (availability zone) solutions so that your applications don’t rely on a single data center as a point of failure. 
    • While a multi-AZ solution might not have mitigated application failure in this case as the entire region failed it would still help. Your application could take advantage of AZs as they come back on line and therefore have a smaller MTTR (mean time to recovery).

Software engineering and architecture are not just the lines of code we write, but the strength of the systems we build.