Cloud Management console and Identity services - severe performance degradation

Incident window: October 28, 2019 17:01 to 17:56 UTC

Impacted Cloud services:

Cloud management console
Identity

Impact Level : high

We have detected significant performance degradation for the impacted Cloud services, resulting in increased latency and error rates for the responses. We are working to restore service performance and availability to normal operating levels.

[2019-10-28 18:15 UTC] Resolved. We have restored performance and availability for the Cloud management console (https://manage.kony.com) and Identity services as of 17:56 UTC.

[2019-10-28 21:24 UTC] We would like to share some additional information concerning the root cause of the incident today and the additional measures we are taking to reduce the likelihood of recurrence as well as to improve our ability to more quickly respond to and address similar events.

In preparation for database certificate rotations, a new certificate bundle was loaded into our staging area. The automation properly replicated that bundle, but it was not expected that the systems would activate the bundle until sometime later.

Unfortunately, the Identity systems reloaded the certificate bundle prematurely. This caused connection errors to the databases, which very quickly locked all of the existing Identity servers out because the errors exceeded the connection threshold. This threshold is in place to prevent attacks on the database from the same IP trying to connect and guess passwords, etc.

We corrected the issue with the assets and the automation was able to propagate the updates to the systems. However, having been locked out, the updates had no effect as the existing servers were unable to reach the database because of the now blacklisted IP. Similarly, we were unable to connect and manually reset the database from any of the authorized instances due to the instances all now having a blacklisted IP.

Once it was determined that automation had pulled the new bundle to all of the instances (and there would be no quick way to connect to the database from any existing servers), the fastest solution was to replaced the entire fleet of Identity servers in such a way that all of the new servers were ensured not to get one of the now blacklisted IPs.

The new servers were built with the proper assets and services were restored as these systems came on-online.

We have identified the following improvement actions that we will be pursuing in an effort to mitigate the recurrence of this type of issue altogether, but also to improve our incident response procedures to more rapidly resolve similar future issues:

Review the ‘attack’ threshold to ensure it is not overly aggressive.
Investigate why the Identity services activated the new asset without the expected coordinated restart.
Investigate the creation of additional highly restricted management instances that may provide access in cases like this to cut the time it takes to gain access for any manual intervention.