API and Login Performance

Incident window: February 13, 2016 08:30 UTC to February 22, 2016 12:45 UTC

Impacted Cloud services:

Login performance
API interactions

Impact Level : high

We have located and fixed the issue causing the vast publish failures over the last 2 weeks. Our infrastructure uses a log parsing server to consume, filter, and hold near realtime logs. Some time back we moved the infrastructure components to an asynchronous model to post logs to this service. Generally, the response times on the synchronous calls are quite low, but on occasion the number and quantity of log records pushed to the service would cause some latency in that API. That jitter was the main reason to move to an asynchronous model. In debugging the publish issue, we ultimately discovered that one code path was still posting directly to the logging service and, as noted above, at random intervals there could be an unexpected latency in the synchronous API used to do that post. The result is that we would see an intermittent ‘pause’ of up to 30 seconds when calling the log ingestion api. The internal timeout was 10 seconds, and this latency caused publish activity to halt. We did extend, as a workaround, the timeout to 15, and then 20 seconds. Those changes eased, but did not eliminate the failures seen by customers. As this was happening in only one code point, it took some time to ‘divide-by-two’ and get down to the internal function call that was causing the excessive wait. Also, 99% of calls in our testing environment ( and in production ) would not show the latency, making locating this anomaly somewhat challenging to find. On Wednesday, the 17th of Feb at 3:47PM GMT, we pushed an emergency change to production to disable the ‘happy path’ logging that was going directly to the logging service. Error logging was left enabled in order to maintain visibility to internal errors in the subject API. Note that we were not, and are not, seeing issues in that service. Latency issues caused the caller to abandon the request, even though the service was completing successfully, but slowly. The change to disable the logging in the normal flow had an immediate effect. The result of the change is that 99% of these calls are now completing in under 400ms. There are a few outliers taking up to 2000 ms. We have since observed no latency issues in the API. On the 22nd of Feb at 12:45 GMT we pushed a more permanent change to the component to use the asynchronous logging facility and have closed out our internal issue.