Analysis of Proton’s July service disruption

Bart Butler

Share this page

Starting Monday, July 11, and ending Wednesday, July 13, Proton Mail, Proton VPN(new window), and Proton Drive experienced intermittent service disruptions, some of which affected some users for an hour or more. These resulted from an unexpected error, not an attack or other malicious activity.

This doesn’t meet the standards we hold ourselves to, nor is it what the Proton community expects from us. We apologize to you, and we’ve taken steps to make these types of interruptions much less likely in the future. Below we explain what happened, how we stabilized the situation, and what we’ve done to prevent future disruptions. 

Background

Over the last several months, our database team has been upgrading our relational databases to be more reliable, faster, and more scalable. We’ve extensively tested these upgrades and, up to this point, performed dozens of them without incident. 

We finished the last upgrade on the morning of Sunday, July 10. We saved this particular database for last because it’s the ultimate source of truth for community member account and email address information. It’s also very, very busy. We had identified this database’s high usage rate as a risk. We already had several initiatives in progress to reduce its workload and improve performance to make the overall system more resilient and scalable.

We decided to upgrade the database before these initiatives were complete because the extensive testing and our experience from the previous database upgrades indicated the new database would be faster. As part of this upgrade, we also moved the database to a newer, faster server. We believed this combination of newer software and hardware would improve performance and buy us additional margin to safely implement our more invasive database optimizations.

The incident

All services and metrics were normal until Monday, July 11, at 2:35 PM UTC. As traffic increased, new connections to the new database began to fail, activating automatic protective measures that prevented new connections. We raced to figure out what was wrong and reduce the database’s load by turning off optional or low-priority services, like message notifications. 

Usually, if an issue like this arises, we would simply undo the update and revert to the previous software version. Unfortunately, this particular upgrade was irreversible as it involved changing the database’s data formats, and we’d already recorded more than 24 hours of changes using the new version. That meant we were on the clock to mitigate the symptoms we observed, find the root cause, and find a permanent way forward.

We now know that the database software was faster after the upgrade, but the new connections to that database were not. Part of this additional connection latency was inherent to the new database codebase, but each new connection also had an extra round-trip network communication, increasing the strain on an already busy networking stack.

This extra round-trip communication was caused by a new authentication default introduced in a recent patch of the database software. This may not sound like much, but this database processes so many connections that the two extra packets the new authentication process added and the additional inherent connection latency were enough to overwhelm the server on both the MySQL and kernel network levels.  

Our response

By the end of Monday, we hadn’t discovered these extra packets, so while we continued to investigate, we also worked to reduce the database’s connection rate. The steps we took included:

  • Shifting more read-only workloads away from the writable database server
  • Additional caching of objects and common queries where possible
  • Deferring low-priority mail to smooth out delivery spikes

We confirmed the authentication issue on Wednesday, July 13, at 1 AM UTC. To mitigate it, our team worked to bring new servers online, which we used to spread out the load over multiple servers to prevent any single one from being overwhelmed. 

At 8:42 AM UTC, we changed the authentication parameter back to the default used in the previous version. This helped reduce the activity load on the database server and, along with the optimizations already made, essentially eliminated the errors and alerts we had received the last two days.

However, we discovered a secondary issue at 2:14 PM UTC that began when we spread the workload over multiple servers. These new replica servers were dedicating more than 50% of their processing power to verify their synchronization with the writeable primary database. This meant that at peak activity times, the number of connections would overwhelm the replica servers, causing the traffic to be rerouted back to the main database, which in turn created instability and occasionally interrupted service until activity levels dropped. 

We eliminated this synchronization load (by caching) shortly before 4:00 PM UTC and stabilized the replica databases, which permanently resolved the intermittent instability.

Going forward

In the days following the incident, we developed, validated, and executed the first of several planned splits of this database to permanently reduce its workload. Our team implemented these splits successfully without disrupting our service. We also have initiatives in progress to improve our connection pooling so that this specific problem cannot reoccur in the future. 

These measures, while necessary, are insufficient. They make us better prepared to fight the last war, but they do not anticipate future problems or address the decision-making process that led to this incident.

To achieve that goal, our infrastructure and application teams are performing a thorough multi-stage review of all services and systems to better understand possible failure modes and how we can mitigate them. The reviewers consist of service owners and other team members to ensure we have subject-matter expertise and fresh sets of eyes. The emphasis of this review is to prevent failures but also to localize potential failures and prevent cascades and large-scale service interruptions to the extent possible. Some fixes will be quick, and others are architectural and will take time, but we’re committed to making Proton services as reliable as the Proton community expects and deserves.

On the decision-making side, we’ve dissected the process and inputs that led to the decision to do the upgrade before the split to ensure that we make the correct decision next time. Very, very few changes we make, whether to infrastructure or the application code, are irreversible, and for good reason. In fact, this is the only such change in the last several years. In this case, attempting to make the change reversible would not have been feasible. But the fact that it was irreversible should have triggered a more cautious change approval process, and the upgrade’s previously successful track record made us overconfident that this database would behave the same, despite its vastly heavier workload.

This is an opportunity for us to re-evaluate our infrastructure approach, and ultimately it will lead to us being more resilient and better prepared in the future. Thank you to everyone in the Proton community for your patience during the service disruption. We have learned many lessons that will serve us well as we work to build an internet where privacy is the default, and we thank you again for your support.

Protect your privacy with Proton
Get a free account

Share this page

Bart Butler

Bart is the CTO of Proton, the company behind Proton Mail and Proton VPN. An expert in email encryption, Bart was previously a physicist at CERN working on the ATLAS experiment. He was also a postdoctoral researcher at Harvard and received his PhD in Physics from Stanford University.

Related articles

The first month of 2023 has brought brutal layoffs from Big Tech, a potential ban of TikTok in the US, and another Twitter breach. But the biggest development of this new year has to be the ascent of ChatGPT.  The chatbot can produce remarkably huma
Hackers were able to steal account details from over 200 million Twitter users and posted the database on a hacking forum in early January 2023. These details include users’ email addresses and Twitter handles, allowing people to potentially identify
From your online shopping receipts to financial statements, your emails contain a great deal of sensitive information about your life, interests, and daily schedule. If you’re concerned about your online privacy, it’s therefore vital to keep your inb
At Proton, we’re committed to building privacy-focused products that are convenient to use and improve your productivity. Last year, we released the new mobile apps for Proton Calendar and Proton Drive, letting you manage your schedule and upload imp
Most email services aren’t secure and limit attachment file sizes, but there are ways to send large files securely. If you’ve ever tried attaching multiple images or video files to an email, you’ll know that it doesn’t always work. We explain ways t
Email wasn’t initially designed to be secure. From spam and phishing attempts to malware, unethical marketers and cybercriminals try to undermine the security and privacy of your inbox every day. Since your inbox stores plenty of sensitive informatio