menu

Feed available - Subscribe to our feed to stay up to date on upcoming maintenance and incidents.

Tokyo performance degradation

Incident window: March 19, 2021 01:11 to April 1 02:00

Impacted Cloud services:

Impact Level : high

The performance and availability of Cloud services in other regions are not impacted.

[2021-03-19 01:11 UTC] Excessive latency for Identity services in the Tokyo region.

[2021-03-19 01:15 UTC] Latency averages have worsened to 50 seconds effecting most APIs.

[2021-03-19 01:25 UTC] Latency averages have improved to 20 seconds, but still impacting most APIs.

[2021-03-19 02:30 UTC] We are continuing to work to restore services.

[2021-03-19 04:20 UTC] Latency averages have improved to 15 seconds, but still impacting most APIs.

[2021-03-19 04:36 UTC] Latency averages have retuned to normal.

[2021-03-19 06:00 UTC] We are continuing to investigate the root cause but there is a risk of reoccurrence.

[2021-03-20 02:00 UTC] Systems are performing normally for the last 24 hours. Escalated issue to Engineering to assist with RCA.

[2021-03-21 02:00 UTC] Systems are performing normally for the last 24 hours. Investigations continue.

[2021-03-22 02:00 UTC] Systems are performing normally for the last 24 hours. Investigations continue.

[2021-03-23 02:00 UTC] Systems are performing normally for the last 24 hours. Investigations continue.

[2021-03-24 00:56 UTC] Excessive latency issue has reoccurred for Identity services in the Tokyo region.

[2021-03-24 01:05 UTC] We are working to mitigate the issue.

[2021-03-24 01:25 UTC] Latency averages have degraded to 25 seconds and are impacting most APIs.

[2021-03-24 01:10 UTC] Latency averages have improved to 10 seconds and are still impacting most APIs.

[2021-03-24 02:00 UTC] Identified high latency on cache cluster, working to mitigate the impact.

[2021-03-24 02:17 UTC] Latency averages have improved to 15 seconds and are impacting most APIs.

[2021-03-24 02:48 UTC] New cache cluster created, reprovisioning systems to use the new cluster.

[2021-03-24 03:11 UTC] New set of Identity nodes start taking requests.

[2021-03-24 03:22 UTC] Removed all old nodes from load balancers.

[2021-03-24 03:24 UTC] [Mitigated] Removed all old nodes from load balancers.

[2021-03-24 06:00 UTC] We are continuing to investigate the root cause but there is a risk of reoccurrence.

[2021-03-25 02:00 UTC] Systems are performing normally for the last 24 hours. We suspect a client cache library issue.

[2021-03-26 02:00 UTC] Systems are performing normally for the last 24 hours. RCA identified as stale data accumulating in an unmanaged cache key. Internally testing a potential fix.

[2021-03-27 02:00 UTC] Systems are performing normally for the last 24 hours. Internal testing validated, functional test in progress. No risk of reoccurrence as mitigation plan in place.

[2021-03-28 02:00 UTC] Systems are performing normally for the last 24 hours. Manually clearing cache daily. Test candidate released internally. Fix Performance testing in progress.

[2021-03-29 02:00 UTC] Systems are performing normally for the last 24 hours. Manually clearing cache daily. Fix Performance testing in progress.

[2021-03-30 02:00 UTC] Systems are performing normally for the last 24 hours. Manually clearing cache daily. QA candidate released for full QA testing.

[2021-04-01 02:00 UTC] Systems are performing normally for the last 24 hours. Manually clearing cache daily. GA candidate released. Scheduling for staging (pre-production) deployment.

[2021-04-01 02:00 UTC] GA candidate approved. Requesting to schedule for production deployment.

[2021-04-01 02:00 UTC] [Resolved] See fix maintenance notice