ProtonBlog(new window)

Starting Monday, July 11, and ending Wednesday, July 13, Proton Mail, Proton VPN(new window), and Proton Drive experienced intermittent service disruptions, some of which affected some users for an hour or more. These resulted from an unexpected error, not an attack or other malicious activity.

This doesn’t meet the standards we hold ourselves to, nor is it what the Proton community expects from us. We apologize to you, and we’ve taken steps to make these types of interruptions much less likely in the future. Below we explain what happened, how we stabilized the situation, and what we’ve done to prevent future disruptions. 

Background

Over the last several months, our database team has been upgrading our relational databases to be more reliable, faster, and more scalable. We’ve extensively tested these upgrades and, up to this point, performed dozens of them without incident. 

We finished the last upgrade on the morning of Sunday, July 10. We saved this particular database for last because it’s the ultimate source of truth for community member account and email address information. It’s also very, very busy. We had identified this database’s high usage rate as a risk. We already had several initiatives in progress to reduce its workload and improve performance to make the overall system more resilient and scalable.

We decided to upgrade the database before these initiatives were complete because the extensive testing and our experience from the previous database upgrades indicated the new database would be faster. As part of this upgrade, we also moved the database to a newer, faster server. We believed this combination of newer software and hardware would improve performance and buy us additional margin to safely implement our more invasive database optimizations.

The incident

All services and metrics were normal until Monday, July 11, at 2:35 PM UTC. As traffic increased, new connections to the new database began to fail, activating automatic protective measures that prevented new connections. We raced to figure out what was wrong and reduce the database’s load by turning off optional or low-priority services, like message notifications. 

Usually, if an issue like this arises, we would simply undo the update and revert to the previous software version. Unfortunately, this particular upgrade was irreversible as it involved changing the database’s data formats, and we’d already recorded more than 24 hours of changes using the new version. That meant we were on the clock to mitigate the symptoms we observed, find the root cause, and find a permanent way forward.

We now know that the database software was faster after the upgrade, but the new connections to that database were not. Part of this additional connection latency was inherent to the new database codebase, but each new connection also had an extra round-trip network communication, increasing the strain on an already busy networking stack.

This extra round-trip communication was caused by a new authentication default introduced in a recent patch of the database software. This may not sound like much, but this database processes so many connections that the two extra packets the new authentication process added and the additional inherent connection latency were enough to overwhelm the server on both the MySQL and kernel network levels.  

Our response

By the end of Monday, we hadn’t discovered these extra packets, so while we continued to investigate, we also worked to reduce the database’s connection rate. The steps we took included:

  • Shifting more read-only workloads away from the writable database server
  • Additional caching of objects and common queries where possible
  • Deferring low-priority mail to smooth out delivery spikes

We confirmed the authentication issue on Wednesday, July 13, at 1 AM UTC. To mitigate it, our team worked to bring new servers online, which we used to spread out the load over multiple servers to prevent any single one from being overwhelmed. 

At 8:42 AM UTC, we changed the authentication parameter back to the default used in the previous version. This helped reduce the activity load on the database server and, along with the optimizations already made, essentially eliminated the errors and alerts we had received the last two days.

However, we discovered a secondary issue at 2:14 PM UTC that began when we spread the workload over multiple servers. These new replica servers were dedicating more than 50% of their processing power to verify their synchronization with the writeable primary database. This meant that at peak activity times, the number of connections would overwhelm the replica servers, causing the traffic to be rerouted back to the main database, which in turn created instability and occasionally interrupted service until activity levels dropped. 

We eliminated this synchronization load (by caching) shortly before 4:00 PM UTC and stabilized the replica databases, which permanently resolved the intermittent instability.

Going forward

In the days following the incident, we developed, validated, and executed the first of several planned splits of this database to permanently reduce its workload. Our team implemented these splits successfully without disrupting our service. We also have initiatives in progress to improve our connection pooling so that this specific problem cannot reoccur in the future. 

These measures, while necessary, are insufficient. They make us better prepared to fight the last war, but they do not anticipate future problems or address the decision-making process that led to this incident.

To achieve that goal, our infrastructure and application teams are performing a thorough multi-stage review of all services and systems to better understand possible failure modes and how we can mitigate them. The reviewers consist of service owners and other team members to ensure we have subject-matter expertise and fresh sets of eyes. The emphasis of this review is to prevent failures but also to localize potential failures and prevent cascades and large-scale service interruptions to the extent possible. Some fixes will be quick, and others are architectural and will take time, but we’re committed to making Proton services as reliable as the Proton community expects and deserves.

On the decision-making side, we’ve dissected the process and inputs that led to the decision to do the upgrade before the split to ensure that we make the correct decision next time. Very, very few changes we make, whether to infrastructure or the application code, are irreversible, and for good reason. In fact, this is the only such change in the last several years. In this case, attempting to make the change reversible would not have been feasible. But the fact that it was irreversible should have triggered a more cautious change approval process, and the upgrade’s previously successful track record made us overconfident that this database would behave the same, despite its vastly heavier workload.

This is an opportunity for us to re-evaluate our infrastructure approach, and ultimately it will lead to us being more resilient and better prepared in the future. Thank you to everyone in the Proton community for your patience during the service disruption. We have learned many lessons that will serve us well as we work to build an internet where privacy is the default, and we thank you again for your support.

Protégez votre vie privée avec Proton
Créer un compte gratuit

Articles similaires

Une communication sécurisée et fluide est la base de toute entreprise. Alors que de plus en plus d’organisations sécurisent leurs données avec Proton, nous avons considérablement élargi notre écosystème avec de nouveaux produits et services, de notre
what is a brute force attack
En matière de cybersécurité, un terme qui revient souvent est l’attaque par force brute. Une attaque par force brute est toute attaque qui ne repose pas sur la finesse, mais utilise plutôt la puissance de calcul brute pour craquer la sécurité ou même
Note : les liens dans cet article renvoient à des contenus en anglais. La section 702 du Foreign Intelligence Surveillance Act est devenue tristement célèbre comme justification juridique permettant à des agences fédérales telles que la NSA, la CIA
En réponse au nombre croissant de fuites de données, Proton Mail propose une fonctionnalité aux abonnés payants appelée surveillance du dark web. Notre système vérifie si vos identifiants ou autres données ont été divulgués sur des marchés illégaux e
Votre adresse e-mail est votre identité en ligne et vous la partagez chaque fois que vous créez un nouveau compte pour un service en ligne. Cette solution est pratique, mais elle expose votre identité au cas où des pirates parviendraient à accéder au
proton pass f-droid
Notre mission chez Proton est de contribuer à l’avènement d’un internet qui protège votre vie privée par défaut, sécurise vos données et vous donne la liberté de choix. Aujourd’hui, nous franchissons une nouvelle étape dans cette direction avec le l