menu

Feed available - Subscribe to our feed to stay up to date on upcoming maintenance and incidents.

Identity services - Oregon performance degradation

Incident window: July 28, 2019 15:34 UTC to August 3, 2019 21:02 UTC

Impacted Cloud services:

Impact Level : high

[2019-07-28 15:42 UTC] Identity services performance in Oregon is significantly degraded starting at 15:34 UTC. The performance of Identity services in other regions is not affected. We are investigating.

[2019-07-28 16:30 UTC] Performance has more significantly degraded since the initial detection and Identity services may be unavailable. We are continuing to investigate and look to mitigate the issue as soon as possible.

[2019-07-28 17:20 UTC] Our initial mitigation actions did not yield significant improvements. We are continuing to work toward mitigating the symptoms and restoring service availability and performance back to normal operating levels.

[2019-07-28 18:31 UTC] The next set of mitigation actions we have performed also did not yield significant improvements. We are continuing to work toward addressing the symptoms.

[2019-07-28 19:42 UTC] Resolved. We have failed over to a new datacenter and can see that the service is available and performing with normal latency as of 19:36 UTC. We are continuing to investigate the underlying cause and will be continuing to monitor the service closely.

[2019-08-02 17:36 UTC] Identity services performance in Oregon is significantly degraded starting at 17:23 UTC. The performance of Identity services in other regions is not affected. We are investigating.

[2019-08-02 18:22 UTC] Performance has more significantly degraded since the initial detection and Identity services may be unavailable. We are continuing to investigate and look to mitigate the issue as soon as possible.

[2019-08-02 19:20 UTC] As we continue to investigate, we are scaling up our infrastructure in an attempt to help mitigate the symptoms and restore service availability and performance.

[2019-08-02 20:23 UTC] Scaling up the infrastructure has not yielded significant improvements. We are considering other options as we continue to investigate.

[2019-08-02 21:22 UTC] Resolved. We have replaced the running infrastructure and can see that the service is available and performing with normal latency as of 21:16 UTC. We are continuing to investigate the underlying cause and will be continuing to monitor the service closely.

[2019-08-03 14:40 UTC] We are seeing degraded performance and are working to restore full services.

[2019-08-03 19:00 UTC] We continuing to see recurring issues and are working on additional mitigations.

[2019-08-03 21:02 UTC] Resolved. Service is restored and we are closely monitoring. We continue the work to identify the root cause.

[2019-08-04 02:00 UTC] The services are currently performing within normal parameters.

[2019-08-06 15:29 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-06 22:33 UTC] On August 7, between 06:00 UTC and 08:00 UTC, we will be performing maintenance for further stabilization of the Identity services in the Oregon region. We have already contacted specific customers who may be affected though no downtime or performance disruption is anticipated.

[2019-08-07 07:30 UTC] The stabilization maintenance to improve load balancing was completed during the stated window without any issues. Services are performing within normal parameters.

[2019-08-07 17:21 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-08 14:10 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-09 13:14 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-12 18:09 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-14 11:33 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-15 11:49 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-19 12:19 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-22 16:54 UTC] Share root cause analysis (below), which had also been shared with affected customers who opened support cases.

Root Cause Analysis

Problem Statement

About 80% of Identity service calls failed with HTTP 500 errors, interfering with login and token verification for customers in the Oregon region. There were three occurrences of the issue:

The issue was automatically detected by our monitoring platform within minutes enabling our operations team to immediately start investigation and remediation actions.

Reported Behaviors

Client applications connecting to the Oregon region of Kony Cloud were experiencing issues authenticating application users. From the logs, it was observed that the Identity Services in the Oregon was having problems connecting to the database.

Problem Analysis & Fix

Below actions were taken based on the symptoms observed during the event and follow-on analysis:

Mitigation Actions
Sunday, July 28 2019
Friday, August 2 2019
Saturday, August 3 2019
Additional analysis
Additional Mitigation Actions
Wednesday, August 7 2019
Solution/Action Plan

Based on the above analysis Kony has identified the below action items:

  1. Optimize the flows in the Fabric client, such as caching profile objects, to reduce the burden on Identity and improve Fabric performance. This will lower the overall demands on the Identity system and database.
  2. The detailed review has identified conditions where additional caching can be leveraged to decrease the usage of the database connection pool while increasing performance of certain APIs.
  3. Capture additional data in the case of longer running transactions to get insights into API usage and to identify areas where we might improve in the future.
  4. Code flow analysis is underway to reduce the overall time a connection to the database is required for each API. In places where database connections were acquired early, the connection will be delayed until there is a specific need rather than aggressively obtaining a connection that may not be required or getting a connection earlier than absolutely necessary. This will have the effect of increasing the overall capacity of the connection pool by minimizing the time connections are in use. This work will span multiple hotfix releases.
  5. Reviews are underway to identify any code path where a connection can be released during the execution of functionally unrelated code (cleanup, statistics, etc). This will help to assure that latency in unrelated functional areas do not have the effect of decreasing the overall capacity of the database connection pool.
  6. Based on performance testing and code reviews related to this issue, optimize database queries to reduce locking and latency.
  7. Upgrade to the latest version of the connection pool manager to pick up maintenance and provide additional functionally in the pool management.
Deliverables
Is the solution provided complete & concrete?

The above action items are to fix the symptoms identified in the logs/dumps and by the analysis performed by the teams. At this point we cannot conclude if the above solutions represent a permanent fix of the problem and there is risk that the mitigations may not prevent a recurrence.

What could have been done better?

Kony has recommended to customers that grow above 50M sessions per year that they purchase a dedicated Identity Environment. These recommendations have been communicated to customers, but we recognize that it is difficult for customers to plan this type of migration. Kony will work toward converting this recommendation to a requirement at certain traffic levels and more clearly communicate the risk of using shared identity for higher traffic applications. A dedicated Identity Environment provides isolation which 1) allows tuning for load specific to one customer, 2) eliminates high traffic or issues from one customer impacting other customers, 4) provides additional security by being in a segregated environment, and 4) allows for easier troubleshooting and isolation of any issue.