ProtonBlog(new window)

Update about the service interruption on Feb. 1, 2021

Share this page

Proton’s servers experienced an outage on Monday that interrupted service for users for about one hour. In line with our policy of full transparency and disclosure, we are publishing the results of our internal investigation into the incident.

The outage began around 7:30 AM CET and impacted all Proton services for approximately one hour. No user data was lost, and this was not the result of any attack or breach. We have identified and corrected the problems that led to the downtime. Moreover, our work over these last few days will inform a number of new tests and safeguards that will make Proton more resistant to problems, even under abnormal conditions.

We understand that even a few minutes of downtime is disruptive to our community. We will be taking steps to ensure our systems are even more resilient in the future. This article details some of those steps, as well as a timeline of the issues.

Timeline

At 7:22 AM CET on Monday, we switched our API services into offline mode to perform a scheduled maintenance to upgrade some of our database hardware, which was successful. The intervention should have taken less than five minutes.

However, four minutes later, at 7:26 AM, we noticed spikes in system CPU and memory utilization across our API server instances, which almost instantly brought all applications to a halt.

As most of the server fleet became unresponsive due to too-high CPU and memory usage, we were not able to recover quickly. First we attempted to stop traffic from reaching the impacted servers so they could “cool down.” However, that was ultimately ineffective; even with no web traffic reaching the servers, we could not connect to them. The servers could only be recovered via a hard reboot triggered from out-of-band-management. For security reasons, this can only be done manually by a small number of engineers, which means it is not a fast process. 

When the servers came back up, some of them also had improper configurations because some of them had gone down before the config changes we were pushing out had all been applied. This required an additional server-by-server check before we could bring the server fleet back online. In total, we needed approximately one hour to restore all services. 

Causes

Each server has a local cache of certain critical configuration data. This data expires in a time period that varies between three and four minutes, and each server then pulls an updated copy. While cache expiration times are normally staggered, setting all services to offline mode had the unfortunate side effect of approximately synchronizing them, as part of this operation resets this cache.

This configuration cache normally should not be needed at all during offline mode. However, there was a bug in our application code where authenticated requests (those of logged-in users, the vast majority of requests) erroneously required this data, even when offline.

What this meant was that offline mode turned into a ticking time bomb. After about three minutes, on every server, the configuration cache expired, triggering requests to a database which at this point was still unreachable. This meant that requests that usually took milliseconds to handle, now required up to 10 seconds as they waited for a timeout. The spike in response times caused API requests to pile up, triggering a massive memory and CPU usage spike which rendered all affected servers unresponsive in a matter of seconds.


We do have some protections in place against “thundering herd” problems such as this, but for reasons which are still under investigation, they were not sufficiently effective to prevent the problem.

To recap, here is the sequence of events and errors:

  1. Each server manages its own cache of configuration data which expires at semi-regular intervals. This data should not be required in offline mode.
  2. Because of a bug, authenticated requests to the servers (the vast majority of requests) triggered dependence on this cached data.
  3. The cached data expired three or four minutes after offline mode was entered, which triggered many requests to an offline database.
  4. This flood of requests resulted in memory and CPU usage spikes on the API servers, which rendered them unresponsive.

Fixes

Our first priority was to fix the two code issues that caused the immediate problem. There is no longer database dependence in offline mode for authenticated requests, and the configuration cache is being reworked to continue to use the “stale” data in the event of a refresh failure rather than clearing it entirely. This will fully prevent this particular issue from recurring in this manner.

We also continue our efforts to fully simulate the failure so that we can build in additional safeguards at different levels of our technology stack. This in-depth defense is essential for ensuring that one bug cannot create a fatal cascade of errors.

This outage has also revealed the need to be better when it comes to thoroughly stress testing our infrastructure and code in abnormal conditions, such as when various dependent services are offline. Offline mode tests, for instance, must include various types of requests, including authenticated ones, and should have gone on for longer to trigger the cache expiration issue. It also highlights the need for more regular testing. While offline mode has been tested in the past, between the last test and the incident on Feb. 1, Proton’s traffic has grown so dramatically that more tests should have been conducted even though none of the software involved has been changed. 

We are also changing our standard operating procedures for maintenance operations to ensure that more engineers are on-call in case something unexpected happens. One of the reasons for the longer recovery time during this incident was the time required to call additional engineers to accelerate the recovery.

This was a highly unusual downtime and stressful both for our team and our users. We apologize for this inconvenience, and we’re grateful for your trust as we work to provide the highest levels of privacy, security, and reliability to the Proton community.

Best regards,
The Proton Team

***

Feel free to share your feedback and questions with us via our official social media channels on Twitter(new window) and Reddit(new window).

Protect your privacy with Proton
Create a free account

Share this page

Proton Team(new window)

We are scientists, engineers, and specialists from around the world drawn together by a shared vision of protecting freedom and privacy online. Proton was born out of a desire to build an internet that puts people before profits, and we're working to create a world where everyone is in control of their digital lives.

Related articles

What was your first pet’s name? In what city were you born?  We’ve all had to answer these questions to reset a long-forgotten password, but consider how that works. Much of this information is easy to find for others (or easily forgotten by you), m
In the early days when Proton started, we often received a question along the lines of “I love the product and what Proton stands for, but how do I know you will still be around to protect my data 10 years from now?”  Ten years and 100 million accou
Credential stuffing is a popular type of cyberattack where attackers take login credentials and use them on thousands of websites, hoping to fraudulently gain access to people’s accounts. It’s an effective attack, but fortunately, one that’s easy to
With Skiff abruptly shutting down operations, many people are on the lookout for alternatives that don’t compromise on privacy — and won’t suddenly disappear. People were attracted to Skiff because it promised privacy, no ads, end-to-end encryption,
Skiff is dead. On Feb. 9, the email company Skiff announced it was being bought by Notion. Many Skiff customers have been shocked by this news, as their inboxes have been sold out from under them. Skiff gave people six months to export their data be
Looking into the Dropbox privacy policy
Dropbox was the first mainstream cloud storage provider, and still the biggest player on the market, with 700 million users in 2022. We took a dive into Dropbox’s privacy policy to see how well the company protects the personal data of those millions