Unresponsiveness in Zoho Cliq

Time of post: 6:56 am PDT, 19 July 2023

Dear Cliq users,

We have encountered an unprecedented issue in our messaging servers. The details of the issue are as follows:

Impacted users : 1/4th of users in US DC (users who access cliq services through cliq.zoho.com)
Issue detection time : 4:22 AM PDT [4:52 pm IST], 19 July 2023
Issue : Unresponsive UI, Inaccessability of chats
Cause : Slow responsivenes in two of our messaging servers due to an unforeseen surge in requests.

We are currently working on resolving it with high priority. We will provide an update here in 30 mins. We apologize for any inconvenience caused.

Update 1 [8:39 am PDT, 19th July]: We have identified specific functions within the system that were causing issues in sending message. To ensure proper functionality, we have implemented restrictions on these functions. We are pleased to report that we have observed improvements in message delivery as a result. We will continue monitoring the situation and provide you with an update in the next 60 minutes. Thank you for your patience.

Update 2 [10:31 am PDT, 19th July]: We are pleased to inform you that the issue with our servers has been successfully resolved, and the Zoho Cliq services are now operating smoothly for all users. Our team is actively monitoring the system to ensure the absence of any further issues.

We sincerely apologize for any inconvenience this may have caused and extend our gratitude for your patience while we resolved the matter.

Update 3 [2: 53 am PDT, 20th July]: We encountered a recurring issue with one of our servers, which persisted from 11:41 pm PDT, 19th July, to 12:14 am PDT, 20th July. However, thanks to our diligent server monitoring team, the problem was swiftly addressed, and services have now been fully restored to their normal operation.

Update 4 [5:17 am PDT, 20th July]: We regret to inform you that we experienced a similar issue with one of our servers, which occurred between 4:02 am PDT, 20th July, and 5:01 am PDT, 20th July. The good news is that our dedicated team is actively working on resolving this matter, and as a result, Cliq services should now be stable for most users. However, we want to be transparent about the situation, and some users may still encounter partial slowness in message processing. Rest assured, our team is diligently addressing this remaining concern, and we will provide you with another update on the progress within the next hour.

Update 5 [6:59 am PDT, 20th July]: Our team has been diligently working on a patch to address the issue with our servers. As a result, Cliq services are in the process of being restored to normal functionality, and we expect everything to be up and running smoothly very soon. We sincerely apologize for any inconvenience this may have caused and appreciate your understanding and patience during this time.

Update 6 [7:19 am PDT, 20th July]: I am pleased to inform you that our team has successfully resolved the issue with our servers. For the past 20 minutes (starting from 7:00 am PDT, 20th July), our systems have remained stable. We are actively monitoring the servers to ensure that this incident doesn't recur in the future. As of now, Cliq services are expected to work normally for all users. You can proceed with your tasks and collaborations without any concerns.

Analysis report:

On July 19, 2023, at 3:49 AM PDT, there was a problem in our network which caused a drop in live connections between clients and servers. As a result of this, our clients were creating new connections, where as the backend servers were trying to repair the

existing connections. This caused a surge in two of our servers, which resulted in partial downtime.

During the partial downtime the followings functionalities would have been affected.

Chat related functionalities
Propagation of user status
Notifications

To fix this issue temporarily, some of the internal functionalities were throttled. This allowed us to control the surge of connections and process the load.

Counter measures :

To prevent this from happening again, the logic for repairing failed connections has been modified, so that no surges should be observed in any of our servers. We hope to push it to production as soon as possible.
We are also working on segregated processing of functionalities to prevent the availability issues due to heavy load.

Thank you once again for your understanding.

Best regards,

Poorvik