Summary

Between October 5th and October 23rd, Heap experienced periodic elevated error rates on our data collection endpoint. These errors manifested themselves as 504 response to tracking clients for Heap’s auto-capture SDKs as well as server side requests to Heap’s track endpoint. The root cause of this issue has since been identified and a fix has been deployed to production as of 05:02 UTC on 10/23.

We have confirmed that the underlying issue has been addressed and no abnormal amounts of failed request have occurred since this resolution. Please refer to the information below on the technical details associated with this problem and its resolution as well as the steps we’re taking to ensure this type of problem doesn’t happen in the future.

As a result of these intermittent issues, you may notice a small percentage of events (0.26% in aggregate) missing in your Heap account during this time period. Please refer to the details below for more specificity on the extent of errors over this time period and the impact on your Heap data set.

We understand that you rely on Heap to capture all your customer data and we sincerely apologize for any issues this may cause you. We treat data collection outages with utmost priority as we strive for a complete and trustworthy data set. Please don't hesitate to reach out to support@heapanalytics.com if you have any questions about how this incident may have affected your dataset or if there's anything else we can help out with!

Technical Details

As part of our ingestion infrastructure, we have Node.js data collection processes running through nginx under a load balancer. When this issue began, the initial symptom was elevated 504 errors in spikes (periodic episodes whose duration was short lived) returned to the load balancer from nginx that would subside when our data collection processes were restarted. While examining the data collection processes, we noticed that during the spikes, CPU usage, event loop saturation, and time spent in garbage collection were abnormally high. While profiling the process, we determined that approximately 50% of CPU time was being spent in async.retry, and 20% in garbage collection. We use async.retry from version 1.5.1 of the async npm library to reliably enqueue messages to Apache Kafka with a retry limit of 100k.

In examining the source code of async.retry, we noticed that the implementation in version 1.5.1 creates an array of functions during invocation: one for each attempt of the function and one for each retry interval. With our high retry limit, large arrays of functions were created each time we enqueued a message to Kafka. This increased heap usage dramatically, leading to longer garbage collections, build up of requests during periods of garbage collection, and requests timing out.

This issue only recently presented itself due to changes to our TLS connections to ensure encryption throughout our entire network. As a result of these changes, communication with Kafka is slightly slower thereby leading to longer time spent in async.retry. This causes objects created during the invocation of async.retry to get promoted to old space, which is garbaged collected infrequently, leading to longer garbage collection times.

Once we were clear on the implications of using high retry limits with this particular version of async.retry, we deployed a change to lower the retry limit. We have since noticed approximately a 50% decrease in median time spent per garbage collection run, significant decrease in event loop saturation, and much more evenly distributed CPU usage across functions. Consequently, request timeouts manifested in the form of 504s have also returned to normal levels.

Moving forward, we are adding better monitoring to our data collection processes, building better tooling for diagnosing and fixing these kinds of issues, and improving the robustness of our data collection infrastructure. Also, we will be upgrading to async 2.0, which has an improved implementation for the retry functionality that will much more gracefully handle our use case.

Error Rates and Data Loss

As previously discussed, this issue surfaced errors for data collection requests in small spikes and we’ve included the time periods where these spikes occurred and the error rates experienced below. The error rates below are a proxy for data loss but should not be taken as exact estimates. In practice, the proportion of missing data is often lower due to retry logic in place for many requests: failed requests are often retried successfully, or may fail multiple times in the process of of retrying. Conversely, Heap’s client side tracking may stop sending requests under circumstances when non-200 status codes are returned. Taking this into consideration, we recommend that you contact us at support@heapanalytics.com if you have any specific questions regarding how this incident affected your organization or if you notice any abnormalities.

Overall (10/05 16:00 - 10/23 5:02)

Over the entire period of intermittent errors, 0.26% of requests failed with 5xx status codes.

10/05 16:20-22:00

During this time period, we experienced one large spike in errors followed by many smaller spikes with an aggregate error rate of 5.7%.

10/06 13:30-21:00

During this time period, we experienced several small spikes in errors with an aggregate error rate of 3.3%.

10/07 19:05-19:20

During this time period, we experienced one large spike in errors with an aggregate error rate of 1%.

10/19 17:46-21:07

During this time period, we experienced several medium spikes in errors with an aggregate error rate of 4.7%.

10/20 18:24-18:53

During this time period, we experienced several medium spikes in errors followed by a large sustained period with an aggregate error rate of 11.2%.

Posted Oct 25, 2017 - 09:09 PDT

Resolved

We've identified and addressed the root cause of recent endpoint instability. A postmortem will be released in the next few days.

Posted Oct 24, 2017 - 12:46 PDT

Investigating

Our data collection end points and app servers are experiencing elevated error rates. We're investigating the issue.

Posted Oct 19, 2017 - 13:26 PDT