Temporary Server Downtime

Incident Report for Heap

Postmortem

On July 22nd, 2015, the Heap collector and dashboard were unavailable for approximately 30 minutes, and we lost all event data submitted during that period (between 2:56 and 3:32pm PDT). Website and apps using Heap were not affected by this downtime, so your users should not have been impacted.

This data loss is absolutely unacceptable for us, and we’re changing our policies to stop this from happening again. As some background on our server infrastructure, our event collector cluster is isolated from the rest of our stack, and can continue to collect data even if other servers have failed. However, the code change that caused this outage overloaded our servers to the point that they were completely unresponsive to administrative tools and required a manual restart from the on-call engineer.

We're changing our deploy policies to require staged deploys for all changes affecting our collectors. We've also set up logging at our load balancer which would allow us to recover collector data even if our entire cluster is down.

If you have any questions about this, please reach out to support@heapanalytics.com.

Posted Jul 23, 2015 - 15:48 PDT

Resolved

We're back! We've pushed a fix to our servers, and our dashboard and API endpoints are all functional again.

Data points between 9:51pm—10:31pm UTC may have been dropped.

Posted Jul 22, 2015 - 15:35 PDT

Investigating

A recent update to our servers resulted in elevated memory usage in our app and collector endpoints, causing some of our dashboard and API endpoints to issue 500 responses.

We're actively looking into the issue now and hope to have a fix live within the hour.

Posted Jul 22, 2015 - 15:03 PDT