On Wednesday afternoon, Heap experienced severe downtime for approximately 6 hours. Between 11:45am and 6:08pm PDT, the Heap dashboard was unavailable and collector endpoints had poor availability: all identify requests were unsuccessful and approximately 54% of event track calls were dropped. Because of this, there will be missing data in your dashboard on October 28, 2015.
This downtime is totally unacceptable to us, and is the worst availability event Heap has had. Our team is working hard to prevent incidents like this in the future. This post details what happened, and what we're doing to fix it.
At 9:44am, our infrastructure monitoring alerted our on-call engineer that a core database backing our identify API and dashboard was running low on resources. He immediately started a maintenance job to free up resources. However by 10:15am it was clear that the cleanup would not finish before the database stopped processing queries. At that point, the issue was escalated to our entire systems team. We launched replacement hardware as a contingency, and attempted to reduce load on the unhealthy database. At 11:46am, the database went offline and entered emergency maintenance mode. Because we have account information collocated with identify information, the dashboard immediately went offline. Meanwhile, our collector cluster poorly handled this severed connection, and servers began cycling in and out of the load balancer, resulting in dropped track calls.
At 5:50pm, the original maintenance job completed, and we re-enabled the dashboard. Our collector cluster recovered automatically, and by 6:08pm, everything had returned to normal.
We’re actively working to improve the robustness of this component of our infrastructure, such as the ability to quickly failover to hot spares. We're also working on isolating our collector architecture such that it will continue collecting 100% of event data even if our identify backend is degraded.
If you have any questions about this, please reach out to support@heapanalytics.com