On May 31st 2012, between 8:43 PDT and 9:37 PDT Zoho Invoice was unreachable. This being a month end, many of our customers were affected and we deeply apologize for this interruption in service.
I am sharing our findings on the root cause of this service interruption here and lessons that we had learned.
The key flaw that caused this outage was a database operation that resulted in the Zoho Invoice MYSQL master database going down. When the MYSQL master server goes down, we have a MYSQL slave that takes over to ensure that the service continues to run uninterruptedly. Unfortunately for us, in this particular episode, a bug surfaced in a piece of code that tried to access the database during this master-slave transition disallowing the application to resume smoothly. This meant that even though the original problem of database hanging resolved itself, the applications couldn't start. We restarted the application and database and things were back to normal.
While the entire episode could have got over in less than 20 minutes, it took almost an hour for our customers to access Zoho Invoice. The operations team was unaware of the software bug that was introduced that was triggered during the database fail-over. We have taken several steps to improve our service - like aggressively pushing all our services to provide a read only access from disaster recovery site (Zoho CRM, Mail and Creator already do this) and of course fixing the bug that resulted in this interruption in the first place and to always err on the side of caution as far as service interruptions are concerned.
We apologize again for this outage. We would like to assure all our customers that we take reliability of Zoho Invoice and all other services very seriously and have taken serious steps to prevent these in future.
Thank you for the continued trust and confidence in Zoho.
Regards
Prashant