Incident Summary
Due to an overload on one of the nodes in the EU DC for Zoho Desk, the system was unable to handle the heavy load, causing a slowdown in requests and resulting in a partial outage for customers with data residing in that node.
On May 2, 8:26 AM, CEST , the incident was identified by our team. Our engineers determined the root cause and initiated efforts to address it by adjusting our system configurations. During this time, we began transferring high-traffic organizations to other nodes to minimize impact on the affected node. Although our secondary adjustments mitigated the issue and enabled the portals to load and work by May 2, 03:19 PM, CEST, it was not completely resolved.
After completing the scheduled movement of top-traffic generating organizations, we deployed a bug fix to prevent the system from holding connections for extended periods. Stability was fully restored on May 3, 12:40 PM, CEST. The incident and its history were also captured on the service availability page (status.zoho.eu).
Technical Breakdown
We identified that our system was experiencing slowness, leading to a significant number of users experiencing difficulty in connecting to Desk. This issue directly impacted the availability and performance of our service.
Further analysis using our monitoring system and log data indicated that the high traffic was legitimate and not due to a DDoS attack. Also, this incident was solely related to a load-surge issue, and there was no data loss or data impact caused by the partial outage. Our engineering team worked to optimize our system configuration to handle the traffic, but it did not completely resolve the issue. In addition, we moved top traffic generating organizations from the impacted node to minimize traffic, resulting in some improvement.
Furthermore, we identified a code bug that held connections for an extended period of time and quickly deployed a live build to rectify the issue.
Timeline (in CEST)
May 2, 09:26 AM
| Incident identified
|
May 2, 11:40 AM
| Root cause identified - high number of connections
|
May 2, 12:15 PM
| System configurations tuned
|
May 2, 01:51 PM
| Started moving top-traffic generating orgs
|
May 2, 03:54 PM
| Second-level system configuration tuning
|
May 3, 12:12 PM
| Preparation of build to fix code bug began
|
May 3, 12:29 PM
| Movement of top-traffic generating orgs completed
|
May 3, 02:40 PM
| Bug fix build went live, stabilizing the system.
|
Future Preventive Measures to Avoid Recurrence of the Issue
- Relocating selected organizations to other nodes to keep the connection count to the affected node at a minimum.
- Monitoring the system connections proactively and re-balancing them as necessary.
- Setting a lower connection threshold to receive early notifications when breached and take prompt action to avoid customer impact.
- Incorporate a code-check configuration rule to prevent code from holding connections for extended periods of time before being shipped to production.
Regards,
Zoho Desk Team