We sincerely apologize for any inconvenience this may have caused and extend our gratitude for your patience while we resolved the matter.
Update 3 [2: 53 am PDT, 20th July]: We encountered a recurring issue with one of our servers, which persisted from 11:41 pm PDT, 19th July, to 12:14 am PDT, 20th July. However, thanks to our diligent server monitoring team, the problem was swiftly addressed, and services have now been fully restored to their normal operation.
Update 4 [5:17 am PDT, 20th July]: We regret to inform you that we experienced a similar issue with one of our servers, which occurred between 4:02 am PDT, 20th July, and 5:01 am PDT, 20th July. The good news is that our dedicated team is actively working on resolving this matter, and as a result, Cliq services should now be stable for most users. However, we want to be transparent about the situation, and some users may still encounter partial slowness in message processing. Rest assured, our team is diligently addressing this remaining concern, and we will provide you with another update on the progress within the next hour.
Update 5 [6:59 am PDT, 20th July]: Our team has been diligently working on a patch to address the issue with our servers. As a result, Cliq services are in the process of being restored to normal functionality, and we expect everything to be up and running smoothly very soon. We sincerely apologize for any inconvenience this may have caused and appreciate your understanding and patience during this time.
Update 6 [7:19 am PDT, 20th July]: I am pleased to inform you that our team has successfully resolved the issue with our servers. For the past 20 minutes (starting from 7:00 am PDT, 20th July), our systems have remained stable. We are actively monitoring the servers to ensure that this incident doesn't recur in the future. As of now, Cliq services are expected to work normally for all users. You can proceed with your tasks and collaborations without any concerns.
Analysis report:
On July 19, 2023, at 3:49 AM PDT, there was a problem in our network which caused a drop in live connections between clients and servers. As a result of this, our clients were creating new connections, where as the backend servers were trying to repair the
existing connections. This caused a surge in two of our servers, which resulted in partial downtime.
During the partial downtime the followings functionalities would have been affected.
- Chat related functionalities
- Propagation of user status
- Notifications
To fix this issue temporarily, some of the internal functionalities were throttled. This allowed us to control the surge of connections and process the load.
Counter measures :
- To prevent this from happening again, the logic for repairing failed connections has been modified, so that no surges should be observed in any of our servers. We hope to push it to production as soon as possible.
- We are also working on segregated processing of functionalities to prevent the availability issues due to heavy load.