|
The root cause of yesterdays problem turned out to be a cache failure and a code bug, which caused unwanted delay in processing of emails. It took us roughly an hour to fix the errors. In this span, there was a substantial backlog in the delivery queue, which took 2 hours to get cleared.
We had a detailed internal discussion over our recent failures. Though there were multitude of reasons, the failures fell into two major categories.
1. Update related
2. Load Related
1. Update Related: There are constant changes and upgrades happening in our application. We have extensive testing, before we release the upgrade to our production environment. Still, new unforeseen issues creep up in the system, which do not surface during our testing. We keep learning and improving our testing along with the automation environments.
Action Plan:
To prevent all the users from getting impacted, we plan to do a 'Staged Upgrade' where the upgrade version is first released only for our organization. Once we believe that the new version is stable, we will go ahead and release the new version to all the users.
2. Load Related: As an email service, we face varied load related problems due to abuse of our system like Spam, Attacks and so on. Legitimate users get impacted due to this. We have an automated system to prevent such abuse, but still new patterns of attack creep up. Again we keep learning and improve the system to protect our service from such attacks.
Action Plan:
We will be categorizing users based on Personal/ Business users. This will ensure that the impact, if any, will only impact a subset of users and not all. We also plan to provide customized POP/ IMAP/ SMTP configurations based on user patterns, once we come up with such categorization.
We are confident that by this way we will be able to handle such unforeseen issues in a much better way, which will reduce the occurrence and impact of such incidents.
If you have any suggestions that we can use, please feel free to share with us here.
We extend our gratitude for users who have stayed with us through these incidents.
|