We have encountered an unprecedented issue in our messaging servers. The details of the issue are as follows:
Impacted users : 1/4th of users in US DC (users who access cliq services through cliq.zoho.com)
Issue detection time : 4:22 AM PDT [4:52 pm IST], 19 July 2023
Issue : Unresponsive UI, Inaccessability of chats
Cause : Slow responsivenes in two of our messaging servers due to an unforeseen surge in requests.
We are currently working on resolving it with high priority. We will provide an update here in 30 mins. We apologize for any inconvenience caused.
Update 1 [8:39 am PDT, 19th July]: We have identified specific functions within the system that were causing issues in sending message. To ensure proper functionality, we have implemented restrictions on these functions. We are pleased to report that we have observed improvements in message delivery as a result. We will continue monitoring the situation and provide you with an update in the next 60 minutes. Thank you for your patience.
Update 2 [10:31 am PDT, 19th July]: We are pleased to inform you that the issue with our servers has been successfully resolved, and the Zoho Cliq services are now operating smoothly for all users. Our team is actively monitoring the system to ensure the absence of any further issues.
We sincerely apologize for any inconvenience this may have caused and extend our gratitude for your patience while we resolved the matter.
Update 3 [2: 53 am PDT, 20th July]: We encountered a recurring issue with one of our servers, which persisted from 11:41 pm PDT, 19th July, to 12:14 am PDT, 20th July. However, thanks to our diligent server monitoring team, the problem was swiftly addressed, and services have now been fully restored to their normal operation.
Update 4 [5:17 am PDT, 20th July]: We regret to inform you that we experienced a similar issue with one of our servers, which occurred between 4:02 am PDT, 20th July, and 5:01 am PDT, 20th July. The good news is that our dedicated team is actively working on resolving this matter, and as a result, Cliq services should now be stable for most users. However, we want to be transparent about the situation, and some users may still encounter partial slowness in message processing. Rest assured, our team is diligently addressing this remaining concern, and we will provide you with another update on the progress within the next hour.
Update 5 [6:59 am PDT, 20th July]: Our team has been diligently working on a patch to address the issue with our servers. As a result, Cliq services are in the process of being restored to normal functionality, and we expect everything to be up and running smoothly very soon. We sincerely apologize for any inconvenience this may have caused and appreciate your understanding and patience during this time.
Update 6 [7:19 am PDT, 20th July]: I am pleased to inform you that our team has successfully resolved the issue with our servers. For the past 20 minutes (starting from 7:00 am PDT, 20th July), our systems have remained stable. We are actively monitoring the servers to ensure that this incident doesn't recur in the future. As of now, Cliq services are expected to work normally for all users. You can proceed with your tasks and collaborations without any concerns.
Analysis report:
On July 19, 2023, at 3:49 AM PDT, there was a problem in our network which caused a drop in live connections between clients and servers. As a result of this, our clients were creating new connections, where as the backend servers were trying to repair the
existing connections. This caused a surge in two of our servers, which resulted in partial downtime.
During the partial downtime the followings functionalities would have been affected.
Chat related functionalities
Propagation of user status
Notifications
To fix this issue temporarily, some of the internal functionalities were throttled. This allowed us to control the surge of connections and process the load.
Counter measures :
To prevent this from happening again, the logic for repairing failed connections has been modified, so that no surges should be observed in any of our servers. We hope to push it to production as soon as possible.
We are also working on segregated processing of functionalities to prevent the availability issues due to heavy load.
We deeply regret any inconvenience you may have experienced. There was another issue with our servers. Rest assured, our team has resolved it promptly. As this is a seperate issue, we will be posting an updated analysis about it here: https://help.zoho.com/portal/en/community/topic/issue-in-zoho-cliq.
We deeply regret any inconvenience you may have experienced. There was another issue with our servers. Rest assured, our team has resolved it promptly. As this is a seperate issue, we will be posting an updated analysis about it here: https://help.zoho.com/portal/en/community/topic/issue-in-zoho-cliq.
Our monitoring systems detected an instance of server slowness that occurred from 22:11 to 22:17 PDT on July 31st. This situation had a noticeable impact on Cliq's performance, particularly in terms of opening conversations and accessing message history.
As a proactive measure, we swiftly responded to the situation by implementing an alternate solution, which has helped stabilize our service. Rest assured, we are actively monitoring the situation to ensure that the issue does not recur.We apologize for the inconvenience.
Thank you for your understanding, and please don't hesitate to reach out if you have any further questions or concerns.