Temporary Data Collection Downtime

Incident Report for Heap

Postmortem

On Wednesday afternoon, Heap experienced severe downtime for approximately 6 hours. Between 11:45am and 6:08pm PDT, the Heap dashboard was unavailable and collector endpoints had poor availability: all identify requests were unsuccessful and approximately 54% of event track calls were dropped. Because of this, there will be missing data in your dashboard on October 28, 2015.

This downtime is totally unacceptable to us, and is the worst availability event Heap has had. Our team is working hard to prevent incidents like this in the future. This post details what happened, and what we're doing to fix it.

At 9:44am, our infrastructure monitoring alerted our on-call engineer that a core database backing our identify API and dashboard was running low on resources. He immediately started a maintenance job to free up resources. However by 10:15am it was clear that the cleanup would not finish before the database stopped processing queries. At that point, the issue was escalated to our entire systems team. We launched replacement hardware as a contingency, and attempted to reduce load on the unhealthy database. At 11:46am, the database went offline and entered emergency maintenance mode. Because we have account information collocated with identify information, the dashboard immediately went offline. Meanwhile, our collector cluster poorly handled this severed connection, and servers began cycling in and out of the load balancer, resulting in dropped track calls.

At 5:50pm, the original maintenance job completed, and we re-enabled the dashboard. Our collector cluster recovered automatically, and by 6:08pm, everything had returned to normal.

We’re actively working to improve the robustness of this component of our infrastructure, such as the ability to quickly failover to hot spares. We're also working on isolating our collector architecture such that it will continue collecting 100% of event data even if our identify backend is degraded.

If you have any questions about this, please reach out to support@heapanalytics.com

Posted Oct 29, 2015 - 20:32 PDT

Resolved

Collectors are now fully operational. No historical data has been lost, but some data from 11:45am and 6:08pm PT on 2015-10-28 will be missing in your dashboard. Our systems team is investigating the extent of missing data, and will post a postmortem with details tomorrow.

Posted Oct 28, 2015 - 18:13 PDT

Update

The dashboard is now available again for all users. Collector endpoint error rates are beginning to stabilize. We will provide another update once error rates are back to 0.

Posted Oct 28, 2015 - 17:56 PDT

Monitoring

Collector endpoints are currently failing approximately 50% of requests, and the dashboard will be unavailable for at least the next 2 hours.

Our systems team is all hands on deck performing emergency maintenance to restore collector and dashboard access.

Posted Oct 28, 2015 - 14:01 PDT

Update

Data collection has returned to normal. We're still investigating the underlying issue. Data was not collected, or was collected sporadically between 12:50pm and 1:00pm PDT.

Posted Oct 28, 2015 - 13:01 PDT

Update

Our data collection endpoints are currently returning 500 errors. Data is not currently being collected. We're currently investigating the issue.

Posted Oct 28, 2015 - 12:54 PDT

Investigating

The dashboard is unavailable for a majority of users currently. Our systems team is investigating.

Posted Oct 28, 2015 - 11:57 PDT