Today, we experienced an outage from 4:15 am to 8:47 am PST. Heap dashboards were unavailable for all customers; heap.identify
calls failed; a small portion of event data was lost. We are still conducting a thorough investigation, but preliminary findings suggest less than 2% of event data was lost.
The immediate cause was a central database server that became unavailable after its primary disk became full. This database stores the information that powers Heap dashboards, so application servers became unavailable as well. It also stores information about your end users that are tracked by Heap and used in the client-side heap.identify
and heap.addUserProperties
APIs as well as the server-side track
and addUserProperties
API calls. As a consequence, calls to that API failed as well. Additionally, large numbers of failing API calls caused follow-on instability in our web-facing data collection services, causing them to drop a small portion of event data, too.
Over the past two weeks, we have been rolling out encrypted backups to all customers to further enhance our data security.
On Tuesday, January 17th, we rolled out this change to the central database server affected in today’s outage. This central database has a different configuration than our other database servers. Specifically, it has a different unix user that runs the database process. Our configuration management code did not take this into account, so it deployed a key to this machine which wasn’t accessible to the user running the database process.
This key is used in two different contexts: 1) encrypting base database backups, which happen every night, and 2) encrypting write-ahead-log segments, which happen throughout the day. (This configuration gives us the ability to restore the database to any particular point in time after the initial base backup.) We manually tested that a base backup ran smoothly on this machine, but not the process that produces the write-ahead-log segments. In fact, it was failing silently: instead of encrypting write-ahead-log segments and archiving them to S3, it was leaving them on disk. After about 72 hours of this, the machine’s disk filled completely and the database shut down.
heap.identify
calls and the server-side API. This is a complicated project that will take a bit longer to finish.In addition to technical followup, there were process failures that made this outage significantly worse. There have been two process follow-ups: