Analysis of Proton’s July service disruption

Starting Monday, July 11, and ending Wednesday, July 13, Proton Mail, Proton VPN(nova janela), and Proton Drive experienced intermittent service disruptions, some of which affected some users for an hour or more. These resulted from an unexpected error, not an attack or other malicious activity.

This doesn’t meet the standards we hold ourselves to, nor is it what the Proton community expects from us. We apologize to you, and we’ve taken steps to make these types of interruptions much less likely in the future. Below we explain what happened, how we stabilized the situation, and what we’ve done to prevent future disruptions.

Background

Over the last several months, our database team has been upgrading our relational databases to be more reliable, faster, and more scalable. We’ve extensively tested these upgrades and, up to this point, performed dozens of them without incident.

We finished the last upgrade on the morning of Sunday, July 10. We saved this particular database for last because it’s the ultimate source of truth for community member account and email address information. It’s also very, very busy. We had identified this database’s high usage rate as a risk. We already had several initiatives in progress to reduce its workload and improve performance to make the overall system more resilient and scalable.

We decided to upgrade the database before these initiatives were complete because the extensive testing and our experience from the previous database upgrades indicated the new database would be faster. As part of this upgrade, we also moved the database to a newer, faster server. We believed this combination of newer software and hardware would improve performance and buy us additional margin to safely implement our more invasive database optimizations.

The incident

All services and metrics were normal until Monday, July 11, at 2:35 PM UTC. As traffic increased, new connections to the new database began to fail, activating automatic protective measures that prevented new connections. We raced to figure out what was wrong and reduce the database’s load by turning off optional or low-priority services, like message notifications.

Usually, if an issue like this arises, we would simply undo the update and revert to the previous software version. Unfortunately, this particular upgrade was irreversible as it involved changing the database’s data formats, and we’d already recorded more than 24 hours of changes using the new version. That meant we were on the clock to mitigate the symptoms we observed, find the root cause, and find a permanent way forward.

We now know that the database software was faster after the upgrade, but the new connections to that database were not. Part of this additional connection latency was inherent to the new database codebase, but each new connection also had an extra round-trip network communication, increasing the strain on an already busy networking stack.

This extra round-trip communication was caused by a new authentication default introduced in a recent patch of the database software. This may not sound like much, but this database processes so many connections that the two extra packets the new authentication process added and the additional inherent connection latency were enough to overwhelm the server on both the MySQL and kernel network levels.

Our response

By the end of Monday, we hadn’t discovered these extra packets, so while we continued to investigate, we also worked to reduce the database’s connection rate. The steps we took included:

Shifting more read-only workloads away from the writable database server
Additional caching of objects and common queries where possible
Deferring low-priority mail to smooth out delivery spikes

We confirmed the authentication issue on Wednesday, July 13, at 1 AM UTC. To mitigate it, our team worked to bring new servers online, which we used to spread out the load over multiple servers to prevent any single one from being overwhelmed.

At 8:42 AM UTC, we changed the authentication parameter back to the default used in the previous version. This helped reduce the activity load on the database server and, along with the optimizations already made, essentially eliminated the errors and alerts we had received the last two days.

However, we discovered a secondary issue at 2:14 PM UTC that began when we spread the workload over multiple servers. These new replica servers were dedicating more than 50% of their processing power to verify their synchronization with the writeable primary database. This meant that at peak activity times, the number of connections would overwhelm the replica servers, causing the traffic to be rerouted back to the main database, which in turn created instability and occasionally interrupted service until activity levels dropped.

We eliminated this synchronization load (by caching) shortly before 4:00 PM UTC and stabilized the replica databases, which permanently resolved the intermittent instability.

Going forward

In the days following the incident, we developed, validated, and executed the first of several planned splits of this database to permanently reduce its workload. Our team implemented these splits successfully without disrupting our service. We also have initiatives in progress to improve our connection pooling so that this specific problem cannot reoccur in the future.

These measures, while necessary, are insufficient. They make us better prepared to fight the last war, but they do not anticipate future problems or address the decision-making process that led to this incident.

To achieve that goal, our infrastructure and application teams are performing a thorough multi-stage review of all services and systems to better understand possible failure modes and how we can mitigate them. The reviewers consist of service owners and other team members to ensure we have subject-matter expertise and fresh sets of eyes. The emphasis of this review is to prevent failures but also to localize potential failures and prevent cascades and large-scale service interruptions to the extent possible. Some fixes will be quick, and others are architectural and will take time, but we’re committed to making Proton services as reliable as the Proton community expects and deserves.

On the decision-making side, we’ve dissected the process and inputs that led to the decision to do the upgrade before the split to ensure that we make the correct decision next time. Very, very few changes we make, whether to infrastructure or the application code, are irreversible, and for good reason. In fact, this is the only such change in the last several years. In this case, attempting to make the change reversible would not have been feasible. But the fact that it was irreversible should have triggered a more cautious change approval process, and the upgrade’s previously successful track record made us overconfident that this database would behave the same, despite its vastly heavier workload.

This is an opportunity for us to re-evaluate our infrastructure approach, and ultimately it will lead to us being more resilient and better prepared in the future. Thank you to everyone in the Proton community for your patience during the service disruption. We have learned many lessons that will serve us well as we work to build an internet where privacy is the default, and we thank you again for your support.

Background

The incident

Our response

Going forward

Artigos relacionados