Outage Postmortem – January 20, 2017

Overview

Today, we experienced an outage from 4:15 am to 8:47 am PST. Heap dashboards were unavailable for all customers; heap.identify calls failed; a small portion of event data was lost. We are still conducting a thorough investigation, but preliminary findings suggest less than 2% of event data was lost.

The immediate cause was a central database server that became unavailable after its primary disk became full. This database stores the information that powers Heap dashboards, so application servers became unavailable as well. It also stores information about your end users that are tracked by Heap and used in the client-side heap.identify and heap.addUserProperties APIs as well as the server-side track and addUserProperties API calls. As a consequence, calls to that API failed as well. Additionally, large numbers of failing API calls caused follow-on instability in our web-facing data collection services, causing them to drop a small portion of event data, too.

Technical Details

Over the past two weeks, we have been rolling out encrypted backups to all customers to further enhance our data security.

On Tuesday, January 17th, we rolled out this change to the central database server affected in today’s outage. This central database has a different configuration than our other database servers. Specifically, it has a different unix user that runs the database process. Our configuration management code did not take this into account, so it deployed a key to this machine which wasn’t accessible to the user running the database process.

This key is used in two different contexts: 1) encrypting base database backups, which happen every night, and 2) encrypting write-ahead-log segments, which happen throughout the day. (This configuration gives us the ability to restore the database to any particular point in time after the initial base backup.) We manually tested that a base backup ran smoothly on this machine, but not the process that produces the write-ahead-log segments. In fact, it was failing silently: instead of encrypting write-ahead-log segments and archiving them to S3, it was leaving them on disk. After about 72 hours of this, the machine’s disk filled completely and the database shut down.

Follow-Up Actions

We have fixed the permissions issue that caused write-ahead-log segments to build up on the affected database machine.
We have configured alerts on this machine and one other that did not have disk space alerting configured. In the future, we will be paged with plenty of notice if the disk on this machine begins to fill up.
Longer term, we have a project in flight that will make the Heap dashboard robust to failures at this layer. If this project were finished, Heap dashboards would have been available and fully functional for the duration of this outage. An engineer is currently working on this, and we expect to have it finished during Q1.
Also longer term, we have a project in flight that will make data collection robust to failures at this layer. If this project were finished, data collection would not have been affected at all by this outage, including heap.identify calls and the server-side API. This is a complicated project that will take a bit longer to finish.

In addition to technical followup, there were process failures that made this outage significantly worse. There have been two process follow-ups:

Heap’s infrastructure team includes members who work around the globe. Our on-call rotation is organized such that alerts that happen at night will initially go to people who will be awake, before waking up engineers in US time zones. Last week, we reorganized this rotation, with one consequence being that engineers in the US were more likely to be woken up if an engineer in another time zone was paged and didn’t know what to do. We did not properly communicate to US engineers that they should make preparations for this possibility, and as a result two of them were not reachable this morning. This has since been made much more clear and has been added to our onboarding materials.
Until this Monday, we had a tool that made it easy for engineers in any time zone to page each other directly if they were ever on call and didn’t know what to do. Unfortunately, this went offline on Monday, because it was linked to the Slack account of a departing employee. We will have it up and running again ASAP.

Posted Jan 20, 2017 - 18:33 PST

Resolved

This incident has been resolved.

Posted Jan 20, 2017 - 12:04 PST

Monitoring

The dashboard is now available for all users. Our infrastructure team is monitoring.

Posted Jan 20, 2017 - 08:56 PST

Investigating

Some of our application servers are experiencing unexpected downtime. Our engineering team is investigating.

Posted Jan 20, 2017 - 05:37 PST