Update about the service interruption on Feb. 1, 2021

Proton Team

Share this page

Proton’s servers experienced an outage on Monday that interrupted service for users for about one hour. In line with our policy of full transparency and disclosure, we are publishing the results of our internal investigation into the incident.

The outage began around 7:30 AM CET and impacted all Proton services for approximately one hour. No user data was lost, and this was not the result of any attack or breach. We have identified and corrected the problems that led to the downtime. Moreover, our work over these last few days will inform a number of new tests and safeguards that will make Proton more resistant to problems, even under abnormal conditions.

We understand that even a few minutes of downtime is disruptive to our community. We will be taking steps to ensure our systems are even more resilient in the future. This article details some of those steps, as well as a timeline of the issues.

Timeline

At 7:22 AM CET on Monday, we switched our API services into offline mode to perform a scheduled maintenance to upgrade some of our database hardware, which was successful. The intervention should have taken less than five minutes.

However, four minutes later, at 7:26 AM, we noticed spikes in system CPU and memory utilization across our API server instances, which almost instantly brought all applications to a halt.

As most of the server fleet became unresponsive due to too-high CPU and memory usage, we were not able to recover quickly. First we attempted to stop traffic from reaching the impacted servers so they could “cool down.” However, that was ultimately ineffective; even with no web traffic reaching the servers, we could not connect to them. The servers could only be recovered via a hard reboot triggered from out-of-band-management. For security reasons, this can only be done manually by a small number of engineers, which means it is not a fast process. 

When the servers came back up, some of them also had improper configurations because some of them had gone down before the config changes we were pushing out had all been applied. This required an additional server-by-server check before we could bring the server fleet back online. In total, we needed approximately one hour to restore all services. 

Causes

Each server has a local cache of certain critical configuration data. This data expires in a time period that varies between three and four minutes, and each server then pulls an updated copy. While cache expiration times are normally staggered, setting all services to offline mode had the unfortunate side effect of approximately synchronizing them, as part of this operation resets this cache.

This configuration cache normally should not be needed at all during offline mode. However, there was a bug in our application code where authenticated requests (those of logged-in users, the vast majority of requests) erroneously required this data, even when offline.

What this meant was that offline mode turned into a ticking time bomb. After about three minutes, on every server, the configuration cache expired, triggering requests to a database which at this point was still unreachable. This meant that requests that usually took milliseconds to handle, now required up to 10 seconds as they waited for a timeout. The spike in response times caused API requests to pile up, triggering a massive memory and CPU usage spike which rendered all affected servers unresponsive in a matter of seconds.


We do have some protections in place against “thundering herd” problems such as this, but for reasons which are still under investigation, they were not sufficiently effective to prevent the problem.

To recap, here is the sequence of events and errors:

  1. Each server manages its own cache of configuration data which expires at semi-regular intervals. This data should not be required in offline mode.
  2. Because of a bug, authenticated requests to the servers (the vast majority of requests) triggered dependence on this cached data.
  3. The cached data expired three or four minutes after offline mode was entered, which triggered many requests to an offline database.
  4. This flood of requests resulted in memory and CPU usage spikes on the API servers, which rendered them unresponsive.

Fixes

Our first priority was to fix the two code issues that caused the immediate problem. There is no longer database dependence in offline mode for authenticated requests, and the configuration cache is being reworked to continue to use the “stale” data in the event of a refresh failure rather than clearing it entirely. This will fully prevent this particular issue from recurring in this manner.

We also continue our efforts to fully simulate the failure so that we can build in additional safeguards at different levels of our technology stack. This in-depth defense is essential for ensuring that one bug cannot create a fatal cascade of errors.

This outage has also revealed the need to be better when it comes to thoroughly stress testing our infrastructure and code in abnormal conditions, such as when various dependent services are offline. Offline mode tests, for instance, must include various types of requests, including authenticated ones, and should have gone on for longer to trigger the cache expiration issue. It also highlights the need for more regular testing. While offline mode has been tested in the past, between the last test and the incident on Feb. 1, Proton’s traffic has grown so dramatically that more tests should have been conducted even though none of the software involved has been changed. 

We are also changing our standard operating procedures for maintenance operations to ensure that more engineers are on-call in case something unexpected happens. One of the reasons for the longer recovery time during this incident was the time required to call additional engineers to accelerate the recovery.

This was a highly unusual downtime and stressful both for our team and our users. We apologize for this inconvenience, and we’re grateful for your trust as we work to provide the highest levels of privacy, security, and reliability to the Proton community.

Best regards,
The Proton Team

***

Feel free to share your feedback and questions with us via our official social media channels on Twitter(new window) and Reddit(new window).

Protect your privacy with Proton
Get a free account

Share this page

Proton Team

We are scientists, engineers, and specialists from around the world drawn together by a shared vision of protecting freedom and privacy online. Proton was born out of a desire to build an internet that puts people before profits, and we're working to create a world where everyone is in control of their digital lives.

Related articles

From having too many inboxes to manage to switching to a new email provider, there are many reasons why you’d want to delete your Gmail account. You might even be concerned that Google is abusing your privacy. Thankfully, you can delete your Gmail a
The first month of 2023 has brought brutal layoffs from Big Tech, a potential ban of TikTok in the US, and another Twitter breach. But the biggest development of this new year has to be the ascent of ChatGPT.  The chatbot can produce remarkably huma
Hackers were able to steal account details from over 200 million Twitter users and posted the database on a hacking forum in early January 2023. These details include users’ email addresses and Twitter handles, allowing people to potentially identify
From your online shopping receipts to financial statements, your emails contain a great deal of sensitive information about your life, interests, and daily schedule. If you’re concerned about your online privacy, it’s therefore vital to keep your inb
At Proton, we’re committed to building privacy-focused products that are convenient to use and improve your productivity. Last year, we released the new mobile apps for Proton Calendar and Proton Drive, letting you manage your schedule and upload imp
Most email services aren’t secure and limit attachment file sizes, but there are ways to send large files securely. If you’ve ever tried attaching multiple images or video files to an email, you’ll know that it doesn’t always work. We explain ways t