Delayed Heap SQL syncs

Incident Report for Heap

Postmortem

Last week we experienced Redshift sync delays across all customers. This was due to three separate, unrelated issues occurring in short succession.

The first issue occurred last Sunday into Monday. We are in the process of changing the way we handle access between our various cloud services from key-based access to instance role-based access. Last weekend, we migrated our Redshift export servers to use role-based access when uploading data to s3. When the switchover was executed, the role assigned to our Redshift servers did not have the requisite permissions to write to the relevant s3 buckets. This caused all uploads to fail, and syncs to become backlogged for all of our customers. This was not detected in our test suite, as our Redshift tests do not cover authentication between our various cloud services.

The second issue occurred last Tuesday. In order to improve query performance for our customers, we are making some significant changes to our data storage schema that require rebuilding a significant fraction of our indexes. Some of the queries we run during Redshift syncs share a database connection pool with our index-building workers. Last Tuesday, the increased volume of index builds in the connection pool caused queries to queue up. The wait time for these queries reached several minutes, and since we run many of them on each sync (on the order of thousands for some customers), this caused syncs to take significantly longer than normal. This caused a backlog of syncs to build up. To address the issue and prevent this in the future, we moved the queries into a separate connection pool.

The third issue occurred last Thursday. We upgraded the version of node.js that we use across our infrastructure. The new version of node was incompatible with the data encoding libraries we use on our Redshift servers, and caused segmentation faults on a significant fraction of sync attempts. This caused our system to fail to recognize completed/ended syncs, which prevented any new syncs from proceeding. The packaging repository we used at the time didn’t easily allow for library downgrades, so it took our engineering team several hours to change packaging systems and revert to the previous version of node.js.

We understand that these delays are unacceptable, so we’re making the following changes to prevent these issues going forwards:

Adding new integration tests that cover access between our servers, s3, and Redshift, as well as our compression/encoding libraries.
Modifying our package installation system to allow easy downgrades to previous versions of node.
Improving our monitoring and alerting systems to better detect systemic Redshift sync issues affecting our entire customer base.

Posted Jul 21, 2017 - 14:32 PDT

Resolved

This incident has been resolved.

Posted Jul 13, 2017 - 20:55 PDT

Investigating

Heap SQL syncs are experiencing delays for some customers. Our data engineering team is investigating and we'll update our status when the sync process returns to the expected sync cadence.

Posted Jul 13, 2017 - 08:44 PDT