Status | Kony - page 9

Kony Cloud status
Current status and incident report

Engagement Services Release

Maintenance window: September 9, 2019 00:01 to 04:00

The maintenance window start and end times are local to the region in which your Clouds are hosted. If you are unsure where your Clouds are hosted, you can hover over a Cloud Name in the Manage Clouds page of the Cloud Management Console and the region will be displayed.

Impacted Cloud services:

Engagement
- Fix an issue with sending of Android pushes for legacy osTypes
- Fix an issue where uploading iOS production certificate in an App was failing for a corner case scenario

Impact Level : minor

No downtime is expected for the impacted Cloud services while this maintenance is being performed. The scheduled maintenance is designed to mitigate disruptions to service availability and performance for the impacted Cloud services. However, it is possible for the impacted Cloud services to be unavailable and/or performance degraded for a short period of time during the maintenance window. Note that no changes are being applied for other Cloud services outside of the list of impacted services above and no service availability or performance disruption is expected for other Cloud services.

Cloud Management Console Release

Maintenance window: September 2, 2019 04:00 to 08:00 UTC

Impacted Cloud services:

Cloud Management Console
- Fix an issue in which users are unable to add operations over stored procedures / functions in the RDBMS Integration Service when an environment is selected

Impact Level : minor

Tokyo performance degradation

Incident window: August 23, 2019 03:36 to 13:05 UTC

Impacted Cloud services:

Cloud services in Tokyo
- Performance degradation is limited to a very small percentage of customers using Cloud services in Tokyo

Impact Level : minor

Note that Cloud services in Tokyo are still available as the issues are isolated to a single data center and we are not observing the same performance issues across the other data centers in that region. Also, the performance and availability of Cloud services in other regions are not impacted.

[2019-08-23 13:30 UTC] Resolved. Beginning at 03:36 UTC, our Cloud infrastructure provider had reported that a small percentage of servers in a single data center in Tokyo had shut down due to overheating, which was due to a control system failure that caused multiple, redundant cooling systems to fail. The control and cooling systems were restored at 10:21 UTC and server recovery completed by 13:05 UTC.

Identity Services Release

Maintenance window: August 26, 2019 00:01 to 04:00

Impacted Cloud services:

Identity
- Introduce performance-related improvements to address DB connection issues
- Add additional logging to capture long-running transactions occupying DB connections
- ⚠️ For customers who have dedicated Identity services, we will start to apply the Identity services upgrade during this maintenance window, but expect to complete for all customers’ dedicated Identity services over the next few weeks. In general, we do not expect impacts to your Identity services during the upgrade process. If we believe there may be any impacts during your Identity services upgrade, we will notify owners and admins in your Cloud account(s) via support case to communicate any impacts and coordinate on a suitable upgrade window

Impact Level : minor

Cloud Management Console, Engagement Services, and Identity Services Releases

Maintenance window: August 19, 2019 00:01 to 04:00

Impacted Cloud services:

Cloud Management Console
- Add “Restrict to Fabric Server to Server Authentication” setting in the Advanced section of Identity provider. This setting is for allowing two Fabric Servers to securely communicate. Enabling this setting will block a traditional client app from using this Identity Service and will only allow this Identity Service to be used from a Fabric Server to authenticate and invoke services.
Engagement
- Fix push notifications not being delivered when Google Firebase Cloud Messaging (FCM) returns an error code outside of the FCM documentation
Identity
- Change Memcache client from spymemcached to AWS ElastiCache to allow Identity client to work with distributed cache systems unlike spymemcached
- Identity services is currently supporting Integrity check enabling or disabling in MF Application. It is now enhanced to support enabling or disabling of Identity Provider level Integrity check.
- ⚠️ For customers who have dedicated Identity services, we will start to apply the Identity services upgrade during this maintenance window, but expect to complete for all customers’ dedicated Identity services over the next few weeks. In general, we do not expect impacts to your Identity services during the upgrade process. If we believe there may be any impacts during your Identity services upgrade, we will notify owners and admins in your Cloud account(s) via support case to communicate any impacts and coordinate on a suitable upgrade window

Impact Level : minor

Kony IQ Release

Maintenance window: August 12, 2019 00:01 to 04:00

Impacted Cloud services:

Kony IQ
- Predict possible utterances and entities from DialogFlow when user is typing a query

Impact Level : minor

Cloud Management Console Release

Maintenance window: August 5, 2019 00:01 to 04:00 to 08:00 UTC

Impacted Cloud services:

Cloud Management Console
- Add support for configuring security headers, which will be appended to identity API responses

Impact Level : minor

Identity services - Oregon performance degradation

Incident window: July 28, 2019 15:34 UTC to August 3, 2019 21:02 UTC

Impacted Cloud services:

Identity services in Oregon

Impact Level : high

[2019-07-28 15:42 UTC] Identity services performance in Oregon is significantly degraded starting at 15:34 UTC. The performance of Identity services in other regions is not affected. We are investigating.

[2019-07-28 16:30 UTC] Performance has more significantly degraded since the initial detection and Identity services may be unavailable. We are continuing to investigate and look to mitigate the issue as soon as possible.

[2019-07-28 17:20 UTC] Our initial mitigation actions did not yield significant improvements. We are continuing to work toward mitigating the symptoms and restoring service availability and performance back to normal operating levels.

[2019-07-28 18:31 UTC] The next set of mitigation actions we have performed also did not yield significant improvements. We are continuing to work toward addressing the symptoms.

[2019-07-28 19:42 UTC] ~~Resolved~~. We have failed over to a new datacenter and can see that the service is available and performing with normal latency as of 19:36 UTC. We are continuing to investigate the underlying cause and will be continuing to monitor the service closely.

[2019-08-02 17:36 UTC] Identity services performance in Oregon is significantly degraded starting at 17:23 UTC. The performance of Identity services in other regions is not affected. We are investigating.

[2019-08-02 18:22 UTC] Performance has more significantly degraded since the initial detection and Identity services may be unavailable. We are continuing to investigate and look to mitigate the issue as soon as possible.

[2019-08-02 19:20 UTC] As we continue to investigate, we are scaling up our infrastructure in an attempt to help mitigate the symptoms and restore service availability and performance.

[2019-08-02 20:23 UTC] Scaling up the infrastructure has not yielded significant improvements. We are considering other options as we continue to investigate.

[2019-08-02 21:22 UTC] ~~Resolved~~. We have replaced the running infrastructure and can see that the service is available and performing with normal latency as of 21:16 UTC. We are continuing to investigate the underlying cause and will be continuing to monitor the service closely.

[2019-08-03 14:40 UTC] We are seeing degraded performance and are working to restore full services.

[2019-08-03 19:00 UTC] We continuing to see recurring issues and are working on additional mitigations.

[2019-08-03 21:02 UTC] Resolved. Service is restored and we are closely monitoring. We continue the work to identify the root cause.

[2019-08-04 02:00 UTC] The services are currently performing within normal parameters.

[2019-08-06 15:29 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-06 22:33 UTC] On August 7, between 06:00 UTC and 08:00 UTC, we will be performing maintenance for further stabilization of the Identity services in the Oregon region. We have already contacted specific customers who may be affected though no downtime or performance disruption is anticipated.

[2019-08-07 07:30 UTC] The stabilization maintenance to improve load balancing was completed during the stated window without any issues. Services are performing within normal parameters.

[2019-08-07 17:21 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-08 14:10 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-09 13:14 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-12 18:09 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-14 11:33 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-15 11:49 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-19 12:19 UTC] The services have been continuing to perform within normal parameters. We are continuing our investigation to identify the root cause.

[2019-08-22 16:54 UTC] Share root cause analysis (below), which had also been shared with affected customers who opened support cases.

Root Cause Analysis

Problem Statement

About 80% of Identity service calls failed with HTTP 500 errors, interfering with login and token verification for customers in the Oregon region. There were three occurrences of the issue:

Sunday, July 28th 16:15 UTC
Friday, August 2nd 17:20 UTC
Saturday, August 3rd 16:30 UTC

The issue was automatically detected by our monitoring platform within minutes enabling our operations team to immediately start investigation and remediation actions.

Reported Behaviors

Client applications connecting to the Oregon region of Kony Cloud were experiencing issues authenticating application users. From the logs, it was observed that the Identity Services in the Oregon was having problems connecting to the database.

Problem Analysis & Fix

Below actions were taken based on the symptoms observed during the event and follow-on analysis:

Product Identity team has performed extensive load testing using representative loads for the Oregon region at the time of the issue. Kony has, so far, been unable to reproduce the issue in a test environment. Under similar loads, Identity Services performed as expected and did not exhibit problems similar to the production issue. However, working with AWS and our analysis of the logs and data have led to several mitigations detailed later in this document.

Mitigation Actions

Sunday, July 28 2019

Identity Servers in Oregon began reporting database connection issues. This could have been internal connection pool ‘crowding’ or true refused connections from the database server.
Restarting servers did not provide relief; the connection failures returned within about 5 minutes. This was odd as we would think if it was a database issue the errors would have returned immediately.
Scaling new servers did not provide relief; new servers showed new connections at the database server. We could see the database server had spare capacity for new connections. New servers soon started reporting errors.
Rebooting database server did not provide relief; while all of the connections were reset and the Identity servers reconnected to the database, the Identity servers again started reporting connection issues.
If Identity was the cause, we would not expect a restart to immediately show the same symptom. The indication was that the database connections were exhausted. We could see this was not the case.
If the database server was the issue, we would expect a reboot of the database server to provide relief. This did not provide relief; before and after the reboot we could see the expected connections.
Database server logs showed some random errors on connections to Identity, but nothing close to the connection counts we were seeing on the servers; database logs could not account for the errors seen at the nodes.
Failover to second datacenter temporarily resolved the issue. This would indicate a hardware or network issue, but we had seen previously that new servers were getting new database connections over the network. Later failures on this cluster confirmed that network hardware did not provide relief.

Friday, August 2 2019

On Friday, August 2 the issue returned.
The database appliance was replicated to a new cluster on the same hardware class, totally replacing the database subsystem. Database server replacement temporarily resolved the issue.

Saturday, August 3 2019

On Saturday, August 3 the issue returned.
The network adapters on all identity nodes were customized per AWS recommendation in case

there was a network issue that we’ve not been able to detect.
The database appliance was replicated to a totally new cluster on a different hardware class,

totally replacing the database subsystem and changing the type of network adapters supporting

the database.
This was the last occurrence of the issue.

Additional analysis

We could see some 4000 successful transactions/min at all times, which indicated network was functioning normally.
During the peak load we have seen that threads waiting for connections to the database are timing out. Requests that were waiting to get a connection during this period were starved as the connection pool could not service the total number of requests within the max wait period.
Database connections are typically sub-second, so a timeout waiting on a connection from the pool indicates connections are being held too long and not being returned to the pool or there is not enough capacity per unit time to service all inbound requests.
The key symptom of the outage is threads were unable to get pool connections to the database, implying the internal database connection pool was at service capacity.
Load seemed to increase after the initial failures, as applications retried API calls, adding to the overall stress on the cluster.
We have not been able to identify the code path that is contributing to holding database locks longer than ‘normal’ and impacting the capacity of the connection pool.
Product Identity team has analyzed the server logs, thread & heap dumps during this period.
Product Identity team is continuing attempts to reproduce the issue.

Additional Mitigation Actions

Wednesday, August 7 2019

Given that the analysis so far was inclusive, that the Product Identity team had not been able to reproduce the issue, and that previous mitigations were not all successful, we took additional mitigation actions. We split off high traffic customers in order to cut overall traffic to each cluster in case high load was the primary trigger.
The problem occurred three times over a seven-day period. After mitigations and infrastructure changes the issue has not reoccurred as of this writing, eighteen days in total, indicating the changes have stabilized the environment.

Solution/Action Plan

Based on the above analysis Kony has identified the below action items:

Optimize the flows in the Fabric client, such as caching profile objects, to reduce the burden on Identity and improve Fabric performance. This will lower the overall demands on the Identity system and database.
The detailed review has identified conditions where additional caching can be leveraged to decrease the usage of the database connection pool while increasing performance of certain APIs.
Capture additional data in the case of longer running transactions to get insights into API usage and to identify areas where we might improve in the future.
Code flow analysis is underway to reduce the overall time a connection to the database is required for each API. In places where database connections were acquired early, the connection will be delayed until there is a specific need rather than aggressively obtaining a connection that may not be required or getting a connection earlier than absolutely necessary. This will have the effect of increasing the overall capacity of the connection pool by minimizing the time connections are in use. This work will span multiple hotfix releases.
Reviews are underway to identify any code path where a connection can be released during the execution of functionally unrelated code (cleanup, statistics, etc). This will help to assure that latency in unrelated functional areas do not have the effect of decreasing the overall capacity of the database connection pool.
Based on performance testing and code reviews related to this issue, optimize database queries to reduce locking and latency.
Upgrade to the latest version of the connection pool manager to pick up maintenance and provide additional functionally in the pool management.

Deliverables

Item 1 will be part of a Fabric maintenance release, dependent on testing and QA.
Items 2, 3, and 4 (milestone 1) are planned be part of the next hotfix release: tentative Aug 26.
Items 4 (milestone 2), 5, 6 and 7 will be part of a future release and are being tracked internally as part of our normal maintenance and fix-pack schedule.

Is the solution provided complete & concrete?

The above action items are to fix the symptoms identified in the logs/dumps and by the analysis performed by the teams. At this point we cannot conclude if the above solutions represent a permanent fix of the problem and there is risk that the mitigations may not prevent a recurrence.

What could have been done better?

Kony has recommended to customers that grow above 50M sessions per year that they purchase a dedicated Identity Environment. These recommendations have been communicated to customers, but we recognize that it is difficult for customers to plan this type of migration. Kony will work toward converting this recommendation to a requirement at certain traffic levels and more clearly communicate the risk of using shared identity for higher traffic applications. A dedicated Identity Environment provides isolation which 1) allows tuning for load specific to one customer, 2) eliminates high traffic or issues from one customer impacting other customers, 4) provides additional security by being in a segregated environment, and 4) allows for easier troubleshooting and isolation of any issue.

Management Services Hotfix

Maintenance window: July 29, 2019 00:01 to 04:00

Impacted Cloud services:

Management
- Renew MDM Vendor Signing certificate (which would have expired on July 30, 2019)

Impact Level : minor

⚠️ It is worth noting that Management services MDM, MAM, and MCM will be reaching end of life on September 30, 2019. Moving forward, Kony will be focusing on Enterprise App Store (EAS). If you are currently using MDM, MAM, or MCM features, please review our end of life announcement to understand how this will impact your services and support.

Kony Cloud status Current status and incident report

Root Cause Analysis

Problem Statement

Reported Behaviors

Problem Analysis & Fix

Mitigation Actions

Sunday, July 28 2019

Friday, August 2 2019

Saturday, August 3 2019

Additional analysis

Additional Mitigation Actions

Wednesday, August 7 2019

Solution/Action Plan

Deliverables

Is the solution provided complete & concrete?

What could have been done better?

Kony Cloud status
Current status and incident report