menu

Feed available - Subscribe to our feed to stay up to date on upcoming maintenance and incidents.

Reporting Services delayed processing of analytics

Incident window: April 05, 2022 19:00 UTC to May 29, 2022 23:15
Customers that require an estimate of dates where data may have been lost should open a support ticket and request that the Cloud Operations team provide an estimate. The teams will be able to provide only an estimate for the data range and potential numnber of messages lost.

Impacted Cloud services:

Impact Level : medium

The processing of analytics data was impacted by a customer application error in combination with impacting traffic from several DDoS attacks on specific customers in April. [There were no breaches from any DDoS activity]. We did not notice the growing analytics backlog until alerts raised our awareness. However, the alerts were configured in 2016 as a percentage of maximum assumed capacity, and that capacity is several fold over normal requirememts. Once alerted, we determined the quantity of analytics produced by the DDoS requests, growth on Kony Cloud, and increases in customer device subscriptions since 2016 would present an issue where we could not process all of the backlog fast enough to prevent data loss. We have identified a bottleneck in our analytics processing and have tasked engineering to provide an architectural solution. The data loss will vary by customer and reports may show reduced/missing data between April 5th and May 5th, with the most likly loss of data occurring before April 24th. Analytics from the backlog are still being processed and we expect recovery efforts to last until May 23rd. We would note that this is the first loss of analytics data of any kind since the Kony Cloud opened and internally we are devistated at this occurance.

[2022-04-17 13:20 UTC] We have stopped the data loss and are working on processing the backlog of analytics data. We estimate 10B data records are queued for processing.

[2022-04-24 18:14 UTC] We have increased capacity, but there is an archectural limit in the analytics design that gates the maximum processing throughput. We are currently processing 15M records per hour. We are working to identify and provision additional processing capacity in strategic areas.

[2022-05-03 16:20 UTC] We have added additional capacity and are now processing 18M records per hour. We are continuing to process the backlog of analytics data (oldest data that has not yet been processed being approximately 10 days old). At the current file count and processing rate, we estimate that the backlog (along with processing of current inbound analytics data) will not complete until sometime after May 22nd. We are continuing to monitor and will continue to provide updates until the backlog has been fully processed.

[2022-05-12 13:40 UTC] We use an at-least-once delivery queue for analytics records, but the order of delivery is not sequential. With multiple readers processing the data we can, and do, get duplicate records. Because of the extreme number of readers in place to deal with the backlog of records the numnber of duplicates being processed will also increase. Due to the necessary pause for scrubbing the data warehouse of duplicates customers may see increased counts in analytics reports for data after Mar 25th. We will not be able to remove duplicates from the database until after we have cleared the backlog, now estimated at May 25th.

[2022-05-17 13:35 UTC] We have collected all outstanding messages from all accounts into our main processing queue. Collection of data from customer accounts is now running with the normal backlog, which is about 10min latency in the customer accounts. We have ingested approximalty 25% of all outstanding collected data. At the current ingestion rate, baring any new issues, we are still projecting to have all data ingested on or about May 25th.

[2022-05-21 14:15 UTC] We have now processed more than half of the analytics backlog. Older messages are not guaranteed to be processed first due to the way that the underlying message queuing system works. This is not normally an issue because we can read and process all of the messages in one pass. However, with such a large backlog, we are seeing the oldest message measurement grow. In order to prevent the expiration of collected messages we are pausing message collection as of May 21 00:00 UTC. We will complete processing the currently collected messages. There will be a slight impact to the completion estimate as a tradeoff for processing all of the current message backlog. We believe we can process the collected analytics in 72hrs. We will update the full completion estimate once the backlog is processed.

[2022-05-23 12:56 UTC] We have injested all data into the warehouse up to Saturday May 21 12:00 UTC. We are currently processing the remaining data and expect to have all data collected and injested today. Once this current activity is complete we will pause collection and processing in order to clean up duplicates from the database. We expect that the duplicate processing to take approximately 12 hours. Currently we expect to have all of the data processed and systems back on line by EOD May 25th.

[2022-05-24 15:41 UTC] We have enabled automatic collection of all customer analytics data and have a current backlog of 44M messages in our queue. We expect these will clear in a few hours. Once the systems are fully caught up we will decide on the timing for maintenance window that is required to clean up duplicate records. During the maintenance window we will pause collection again, but customers will be able to access the reports services in the consoles.

[2022-05-26 17:45 UTC] We have ingested all analytics from the backlog and the system is running normally for the last 36 hours. Due to the quantity of duplicate records in the dataset we project the cleanup to take approximately 12 to 18 hours. We are scheduling the cleanup for Saturday, May 28th. Reporing services will be available with momentary outages as we resize the data warehouse systems to handle the scale of this cleanup. Any outage is expected to be less than 5 minutes while the system failover.

[2022-05-28 13:00 UTC] We are pausing processing of analytics while the data warehouse contents are processed for duplicate records. Reports against the data will be available during the process, but there will be a database restarts during the procedure. Any outage would be only for a few moments. For any error, please wait about file minutes and rerun your report.

[2022-05-29 05:00 UTC] We have completed the maintenance to remove duplicate records. We will begin processing data from May 28 13:00 UTC and expect to be caught up by early Sunday morning US EST.

[2022-05-29 23:15 UTC] Resolved. The maintenance has completed and the analytics systems are running normally.