As we reported earlier on our blog, we had an incident which caused some emails from over a 20 hour period to disappear.
Immediately afterwards, we initiated data recovery steps and within a day, we were able to recover the data and begin restoring emails into users accounts. On Saturday, we finally finished restoring emails to the last impacted accounts.
Our goal is to maintain 100% data availability and we apologize to those users who weren’t able to access some emails for a couple days while we worked on the recovery. Needless to say, a number of steps have been taken to avoid a repeat of this problem and we have strengthened our standard operating procedures (SOP) to include even more safeguards.
The root cause was found to be a Linux service called monit which automatically restarts services when it detects them to be crashed or is otherwise not running for some reason.
In our SOP, the first step for most procedures is to shut down monit. However, when one of our new engineers went to perform some changes on Monday, this was not done. The database changes we were doing on Monday required the database server to be shut down for a period of time, and the commands to do this were indeed issued. However, since monit was still running, the database server was automatically turned back on unbeknownst to the engineer. As a result, changes were made on a running database leading to data corruption.
While it is easy to lay blame on an individual engineer for not following the SOP, there are also organizational deficiencies that allowed this lapse to occur. The team as a whole is under immense time pressure to work quickly and support more users, so shortcuts were tolerated. This was generally OK because the core developers understood the system very well and knew with certainty which steps could be skipped without risk. However, we also inadvertently created an environment for new employees where the SOP was treated a guideline and not rules that had to be followed to the letter.
To remedy this situation, we have now enacted new regulations where changes on the production systems can only be made with the approval of ALL core developers. Furthermore, SOP shortcuts will no longer be tolerated, regardless of who is making the change.
These changes will inevitably slightly slow down our development and scaling process, but as a group, our core priorities are security and reliability and these must come before all other considerations. We would like to thank everybody (especially those still on the waiting list) for their understanding.