Elevated Endpoint Errors
Incident Report for Heap
Postmortem

Summary

Between October 5th and October 8th, we experienced periodic elevated error rates on our data collection endpoint. As a result, you may notice a small percentage of events missing in your Heap account during this window. Please refer to the details below for more specificity on the extent of errors over this time period.

Our engineering team is in the process of addressing the root cause of this issue and we will update our status page with a more detailed postmortem when a complete solution is in place.

We understand that you rely on Heap to capture all your customer data and we sincerely apologize for any issues this may cause you. As mentioned, we'll publish a more thorough response once fully resolved that includes measures being put in place to prevent this type of issue in the future. In the short term, please don't hesitate to reach out to support@heapanalytics.com if you have any questions about how this incident may have affected your dataset or if there's anything else we can help out with!

Details

Between 2017-10-05 and 2017-10-08 (UTC), we experienced periodic elevated error rates in our data collection endpoint. The immediate cause appeared to be that our data collection services were running out of CPU, which is highly uncharacteristic of these services. We've since put a temporary patch in place, which has reduced the error rate to almost zero, though small spikes in errors are still occurring. We're currently working on a permanent solution to address any outstanding errors and prevent this problem in the future.

The specific error rates detailed below should be interpreted as an upper-bound in the aggregate. In practice, the proportion of missing data is often lower due to retry logic in place for many requests: failed requests are often retried successfully, or may fail multiple times in the process of of retrying.

Thursday 10/05 - Friday 10/06

Over these two days, we observed periodic and sustained collector errors. Approximately 5.7% requests between 16:20 and 22:00 on Thursday 2017-10-05, and 3.3% of requests between 13:30 and 21:00 on Friday 2017-10-06 failed with 5xx status codes.

Saturday 10/07 - Sunday 10/08

Over the weekend, we observed periods of time with spikes in 5xx statuses every 30-60 minutes. Approximately 1% of requests failed on Saturday, mostly between 19:05 and 19:20. These were successfully mitigated by manually triggering rolling restarts of collection services, which were subsequently automated. This reduced the number of 5xx statuses to under 0.005% over the 12 hours from 03:00 to 15:00 on 2017-10-09.

Sunday 10/08 - Today

Small spikes in elevated error rates are still occurring with regularity over the past week. In aggregate, these errors represent less than 0.005% of total requests to Heap. Our engineering team is still in the process of addressing the root cause to this issue and we plan to update this postmortem as future changes are made.

Posted Oct 17, 2017 - 08:38 PDT

Resolved
All clear!
Posted Oct 08, 2017 - 15:13 PDT
Investigating
Our collector endpoints are experiencing elevated error rates and latency. Our engineering team is investigating.
Posted Oct 08, 2017 - 11:37 PDT