EU DC Partial Outage Resolved: A Detailed RCA

EU DC Partial Outage Resolved: A Detailed RCA

Incident Summary

Due to an overload on one of the nodes in the EU DC for Zoho Desk, the system was unable to handle the heavy load, causing a slowdown in requests and resulting in a partial outage for customers with data residing in that node.

On May 2, 8:26 AM, CEST , the incident was identified by our team. Our engineers determined the root cause and initiated efforts to address it by adjusting our system configurations. During this time, we began transferring high-traffic organizations to other nodes to minimize impact on the affected node. Although our secondary adjustments mitigated the issue and enabled the portals to load and work by May 2, 03:19 PM, CEST, it was not completely resolved.

After completing the scheduled movement of top-traffic generating organizations, we deployed a bug fix to prevent the system from holding connections for extended periods. Stability was fully restored on May 3, 12:40 PM, CEST. The incident and its history were also captured on the service availability page (status.zoho.eu).
 
Technical Breakdown

We identified that our system was experiencing slowness, leading to a significant number of users experiencing difficulty in connecting to Desk. This issue directly impacted the availability and performance of our service.
 
Further analysis using our monitoring system and log data indicated that the high traffic was legitimate and not due to a DDoS attack. Also, this incident was solely related to a load-surge issue, and there was no data loss or data impact caused by the partial outage. Our engineering team worked to optimize our system configuration to handle the traffic, but it did not completely resolve the issue. In addition, we moved top traffic generating organizations from the impacted node to minimize traffic, resulting in some improvement. 

Furthermore, we identified a code bug that held connections for an extended period of time and quickly deployed a live build to rectify the issue.
 
Timeline (in CEST)
 
May 2, 09:26 AM
Incident identified
May 2, 11:40 AM
Root cause identified - high number of connections
May 2, 12:15 PM
System configurations tuned
May 2, 01:51 PM
Started moving top-traffic generating orgs
May 2, 03:54 PM
Second-level system configuration tuning
May 3, 12:12 PM
Preparation of build to fix code bug began
May 3, 12:29 PM
Movement of top-traffic generating orgs completed
May 3, 02:40 PM
Bug fix build went live, stabilizing the system.
 
Future Preventive Measures to Avoid Recurrence of the Issue
  • Relocating selected organizations to other nodes to keep the connection count to the affected node at a minimum.
  • Monitoring the system connections proactively and re-balancing them as necessary.
  • Setting a lower connection threshold to receive early notifications when breached and take prompt action to avoid customer impact.
  • Incorporate a code-check configuration rule to prevent code from holding connections for extended periods of time before being shipped to production.

Regards,
Zoho Desk Team