Data anonymization doesn't make you anonymous. Here's why.

When companies say that your personal data is anonymized, it sounds like your online identity is scrubbed away for good. Your information becomes noise in a dataset, so you can let your guard down. Well, not quite.

Anonymized data is data with the most obvious personal identifiers removed, like name or home address. But in a world full of interconnected databases, it only takes a handful of seemingly unrelated details to track down someone.

Research(new window) has shown that only 15 data points are needed to identify 99.98% of people in a dataset of millions. And with AI connecting the dots across your online activity, the gap between “anonymous” and “identified” is shrinking.

Let’s take a look at what data anonymization actually means and what you can do to better protect your privacy.

What is data anonymization?
Anonymization vs pseudonymization
Common data anonymization techniques
How companies use anonymized data
Data reidentification, or why anonymized data isn’t truly anonymous
AI is making deanonymization faster and cheaper
Protect your privacy by minimizing and encrypting data
Anonymization is not a privacy guarantee

What is data anonymization?

Data anonymization is the irreversible process of removing anything personally identifiable from data points, such as your name, email address, contact number, or birthday. The goal is to sever the link between a record and a person as much as possible.

However, after anonymization, data still includes indirect clues, such as your general location, browsing habits, and age range. Individually, these details are pretty harmless, but when taken altogether, they form a pattern that points to you.

A diagram explaining how anonymization works

Some types of data, such as biometric, are especially difficult (or even impossible) to truly anonymize. You can create a safe username but not change a person’s face, fingerprint, or iris pattern.

When data is truly anonymized, it is no longer considered personal under privacy laws such as the GDPR. That means companies may use it without the consent and protection requirements that apply to personal data.

But GDPR’s Recital 26(new window) sets a high bar: data must no longer identify a person, even when considering other information and methods that could reasonably be used to reidentify them. So, removing names or email addresses is not enough if the remaining data still points back to someone.

Anonymization vs pseudonymization

While anonymization permanently removes identifiable information to ensure it cannot be traced back to an individual, pseudonymization replaces that data with a label, token, or code. The original identity is stored separately in a secure key or lookup table, but with the right access, that label can be linked back to a real person.

An example of pseudonymization is medical research, where patient names are replaced with codes. Researchers can still track the data, but only authorized personnel with the key can reconnect it to the individual.

This difference is simple but important. Pseudonymization is considered personal data under regulations like the GDPR because it can still be linked back to someone. Anonymized data, by contrast, falls outside those obligations only when reidentification is no longer reasonably possible.

Common data anonymization techniques

Companies use different anonymization methods depending on how they plan to use the data. Here are some common ones:

Data masking replaces information with fake data, such as swapping a phone number for a fictional one.

Generalization makes data less specific, like using age ranges rather than an exact age.

Data swapping shuffles information across records so they no longer match the original person.

Data perturbation obscures individual details while preserving data trends, such as changing data by rounding numbers.

Synthetic data relies on artificial data that imitates the patterns of the original dataset without directly using real records.

These techniques can reduce privacy risks, but their effectiveness depends entirely on how well they’re applied. Even then, they may not remove every clue that could identify someone.

How companies use anonymized data

Anonymized data is valuable because companies can legally use it however they want, without your consent. Common uses include:

Analytics and development: Companies study user behavior to improve products, measure trends, and guide business decisions.

Advertising: Browsing and purchase patterns can be used to build audience segments for targeted ads, even without your name attached.

Data brokers: Some data is aggregated, packaged, and resold by data brokers. These companies combine information from apps, websites, public records, credit data, and more to build detailed profiles that are sold to whoever wants them, with little legal oversight.

Training AI models: Large datasets are often used to train AI systems, including data drawn from user activity, purchased datasets, and public or scraped sources.

Medical research: In some countries(new window), anonymized medical data can be sold to pharmaceutical companies or shared with researchers.

Anonymized data can be used for good, such as improving services or supporting research. The problem is that it creates a strong commercial incentive for data brokers and advertisers to collect, combine, share, repackage, and sell information about people, often in ways they do not fully understand or meaningfully consent to. For people who later decide they want out, removing their data is not simple.

California’s privacy regulator created the DROP(new window) system because deleting data from hundreds of data brokers has historically been difficult for individuals to manage. This is much more difficult with AI training data, because once data has influenced a trained model, removing it may require machine unlearning(new window) techniques that AI companies do not have an appetite for(new window).

Data reidentification, or why anonymized data isn’t truly anonymous

If someone tells you that they’re looking for a man in his 30s who drives a white car and lives in your neighborhood, you might already have a good idea of who they mean. None of those details can separately identify the person, but together, they help narrow the possibilities by excluding everyone else. Anonymized data works the same way: Even if names and contact details are removed, the remaining information can still become revealing when enough details are combined.

When these patterns are cross-referenced with other sources, such as social media or public records, it becomes possible to connect supposedly anonymous data to a person. This is known as reidentification, and it’s often easier than you expect.

A diagram explaining how reidentification works

Researcher Latanya Sweeney purchased a hospital dataset(new window) for $50 that contained indirect identifiers, such as demographics, diagnoses, and billing details. Revealing details such as names were not included. By cross-referencing this data with local news stories on hospitalizations, she was able to match 43% of patients to their records, including the full medical history of a patient involved in a reported motorcycle crash.

AI is making deanonymization faster and cheaper

If the only protection against reidentification from anonymous data is time, patience, and manual cross-referencing, that incidental protection is eroding with AI.

Research shows that large language models (LLMs) can analyze someone’s posts across platforms, cross-reference public information, and identify anonymous users with incredible precision. In one study on at-scale deanonymization(new window), LLM-based methods identified up to 68% of people, and when they made a match, they were correct 90% of the time.

Sweeney had to pay only $50 for a dataset of hundreds of thousands of records. Today, LLMs can deanonymize profiles for $1-4 each and do the work automatically. They also don’t need clean, structured datasets and can spot patterns in ordinary posts and comments.

As one of the researchers puts it:

“Ask yourself: Could a team of smart investigators figure out who you are from your posts? If yes, LLM agents can likely do the same, and the cost of doing so is only going down.”

Protect your privacy by minimizing and encrypting data

Anonymizing data is not enough, as reidentification can happen when dots are connected. The best way to protect yourself is to minimize your digital footprint, making yourself harder to reidentify.

You don’t have to go off the grid, but you should be more deliberate about what and how you share. Here are some practical tips:

Compartmentalize your identity to protect against cross-referencing

When you use the same email and username on all platforms, your details are easy to put together. It’s simple to generate different usernames for different accounts, but using unique email addresses for everything can be a nightmare unless you use email aliases.

Aliases create separate addresses that forward messages to your main inbox without exposing your real email address and identity. If you use a unique email alias for every service, you can see where a leak or sale came from.

For example, if you create one alias only for Company A and later receive emails to that alias from Company B, you know Company A either shared, sold, leaked, or lost control of your address. You can then disable that alias without affecting your main inbox or your other aliases.

Be inconsistent to protect against identifiable patterns

The more consistent your details are across platforms, the easier it is to build a unique profile around you. Where possible, avoid giving more information than necessary.

For instance, use a general location instead of your exact city, round your age, and skip optional fields. Also, consider making small variations in your writing style, such as repeated phrases, punctuation, or common typos, to limit automated identification.

Limit your digital footprint to protect against AI analysis

LLMs can identify people by finding patterns in posts and writing. The less public content tied to your identity, the less material there is to work with. Consider how much personal detail you reveal when posting — not just facts, but habits, opinions, and recurring topics that make you stand out. Be sure to opt out of AI training on as many platforms as possible.

Use end-to-end encrypted services to protect against data collection

Encryption doesn’t just protect data from hackers but limits what can be read in the first place. An email provider that can’t read your messages can’t scan them for advertising, use them for AI training, or share insights with brokers.

Use end-to-end encrypted email for private communications, secure cloud storage to safely store and share files, and a no-logs VPN(new window) to encrypt your browsing activity — all of which reduce the amount of data you expose unwillingly.

Opt out of data collection to protect against brokers

It is possible to remove personal information from the internet, even from data brokers, but it takes persistence. It won’t stop future data collection, but it can give you a fresh start. Going forward, minimizing your digital footprint and encrypting your data where possible will help limit what gets collected.

A diagram explaining how to make yourself more anonymous

Anonymization is not a privacy guarantee

The main takeaway is that “anonymized” does not always mean safe, permanent, or impossible to trace. The less personal information you share, the less consistent you are across platforms, and the more control you keep over your accounts and aliases, the fewer signals there are to link back to you.

Your data may be anonymized on paper, but your strongest protection starts before that point: with what and where you choose to share, and how easily it can be connected to the rest of your digital life. That also means being intentional about the services you use every day, and the companies that own them.

Proton apps are open source, ad-free, and designed to avoid tracking and AI training on any of your data. With end-to-end encryption, zero-access encryption, and a business model exclusively funded by our community of paying subscribers, we do not need to exploit your data, we cannot read most of it — and we don’t want to.

Anonymization explained: If your data is anonymous, why can advertisers still target you?