Proton
An illustration of anonymized data.

The truth about anonymized data

Many companies that handle personal information reassure their users by saying that all the data is “anonymized.” If you don’t know any better, that sounds reassuring.

However, the method most companies use to anonymize data and the size of modern databases make it easy for attackers to re-identify individuals. From medical records to cell phone data sets, it only takes about a dozen pieces of information to find the person behind each “anonymous” record.

Part of our mission at Proton is to make sure people understand the privacy risks of sharing data. Maintaining your data security means sharing your data only with trustworthy organizations that are clear about what data they collect and what they do with it. 

Everyone leaves a trace

By definition, truly anonymized data is stripped of all the elements that could possibly identify the correct individual. (In this article, we’ll refer to this individual as the “data subject,” borrowing the GDPR(новое окно) term.) The most popular mode of anonymization is to remove personally identifiable information from a database, such as your name, your birth date, your phone number, your home address, etc.

On the surface, this might seem like enough to protect your privacy. However, as you begin overlapping different types of data, you can start to identify people. Indeed, one data anonymization company, Aircloak, even acknowledges that true anonymization is extremely difficult(новое окно): “as is the case with IT security, no 100% guarantee can be given, and often there is the need for a risk assessment.”

Here’s an example of re-identification(новое окно) from the Journal of Technology Science that can give you an idea of how this might work. In it, an “anonymous” medical record can be cross-referenced with another source of information (in this case a newspaper brief about a motorcycle crash) to identify the patient’s name.

It only takes 15 data points to make 99.98% of people identifiable in a database of 7 million people, according to one paper published in Nature(новое окно).

Fifteen data attributes may seem like a lot, but it’s not. The report references the Experian data breach(новое окно), which leaked an “anonymized” database containing 248 data points on 120 million Americans. Major political campaigns also keep massive databases(новое окно) (and distribute them to their allies) which include hundreds of data points on their data subjects.

If a database has fewer people in it, it becomes substantially easier to re-identify individuals. This investigation(новое окно) needed only four data points. There are dozens(новое окно) of other(новое окно) examples(новое окно).

Why this matters

Re-identifying data contained in a supposedly anonymized database is not just a neat statistical trick for academics. It has real-world consequences. Anonymized data is treated differently because it is supposed to protect the privacy of its data subjects.

In the US, anonymized medical records can be sold(новое окно) to pharmaceutical companies. A similar practice is allowed(новое окно) in the UK. 

Some countries do a better job of requiring effective anonymization. The European Union’s GDPR covers this in Recital 26(новое окно), which says that data must truly be anonymous to be exempt from the regulation’s data protection rules. And there are methods of anonymization, such as data generalization or perturbation(новое окно), that are more effective.

However, this issue touches on more than just the technical difficulties presented by anonymization. It also raises the misleading promises companies make when they talk about how they treat your data.

Data analysis can provide numerous benefits to citizens, organizations, and governments, and it is legitimate to collect and analyze data for specific purposes. The distributed privacy-preserving contact tracing(новое окно) project is one example of how data collection could be used to trace COVID-19 infections while protecting individuals’ privacy.

However, data collection must always be made clear to the data subject, and people should always have a choice. Many companies present vague or hard-to-decipher privacy policies(новое окно) that make it almost impossible for data subjects to know what data is being collected and who it is being shared with. These companies treat anonymization as a way to sell data while still meeting the minimum requirement for data security.

However, if malicious actors can re-identify you from anonymized data, it raises ethical questions about such a business model. As a user, it means you should evaluate the companies you share data with even more closely. And companies, at the very least, should notify their users of the risk of re-identification before they share their data. If not, it is impossible for users to give their informed consent. 

The Proton solution

When it comes to data protection, the best approach is to collect as little as necessary to securely deliver service to users. At Proton Mail, we work hard to limit the amount of information we collect, as we make clear in our privacy policy(новое окно).

We use technical safeguards, such as end-to-end encryption(новое окно) and zero-access encryption(новое окно), to ensure that you have control over who has access to your messages. We protect our users’ privacy by limiting the amount of data we require(новое окно) to set up an account and by offering anonymous payment options(новое окно).

We’ve also removed financial incentives to access users’ data. Proton is funded through a subscription-based, ad-free business model(новое окно). This allows us to focus on our main mission: to increase freedom and privacy online. If we fail to protect your data, we will lose users, which means our interests are aligned with the community’s.

As a tech company made up of physicists and engineers, we recognize the value of data. However, where other companies see your data as a resource to be exploited, we see something personal that belongs to you and deserves safekeeping.

You can get a free secure email account from Proton Mail here.

We also provide a free VPN service(новое окно) to protect your privacy.

Proton Mail and Proton VPN are funded by community contributions. If you would like to support our development efforts, you can upgrade to a paid plan(новое окно). Thank you for your support.


Feel free to share your feedback and questions with us via our official social media channels on Twitter(новое окно) and Reddit(новое окно).

Статьи по теме

Proton Mail and Proton Calendar winter product roadmap
en
  • Новости о продуктах
  • Proton Calendar
  • Proton Mail
Preview upcoming updates to Proton Mail and Proton Calendar, including performance boosts, new features, and enhanced privacy tools.
Gantt chart displaying Proton Drive plans and development of new features
en
  • Новости о продуктах
  • Proton Drive
Discover the tools, features, and improvements coming to Proton Drive’s secure cloud storage and document editor this winter and spring.
laptop showing Bitcoin price climbing
en
  • Советы о конфиденциальности
Learn what a Bitcoin wallet does and the strengths and weaknesses of custodial, self-custodial, hardware, and paper wallets.
pixel tracking: here's how to tell which emails track your activity
en
  • Советы о конфиденциальности
Discover what pixel tracking is and how it works, how to spot emails that track you, and how to block these hidden trackers.
A cover image for a blog describing the next six months of Proton Pass development which shows a laptop screen with a Gantt chart
en
  • Новости о продуктах
  • Proton Pass
Take a look at the upcoming features and improvements coming to Proton Pass over the next several months.
The Danish mermaid and the Dutch parliament building behind a politician and an unlocked phone
en
  • Новости о конфиденциальности
We searched the dark web for Danish, Dutch, and Luxembourgish politicians’ official email addresses. In Denmark, over 40% had been exposed.