Index

Database anonymization techniques

Posted in Category • May 23, 2020

Sometime ago, I was tasked with anonymize verb. remove identifying particulars or details from (something, especially medical test results), typically for statistical or other purposes.

a (small) production database before giving it to a partner for functional testing. It is not that I do not trust the partner, but my employer values customer privacy a lot, so anonymization is a a must. In this post I am exploring the techniques that are available to achieve anonymity while preserving many characteristics of the database.

Database anonymization is not a new topic, depending on your purpose and your time and effort budget, you might want to choose one over another.

Having said that, I am not totally sure these techniques are useful with regard to GPDR Following is the official GPDR text. It is too lengthy so I would be very appreciate it if someone provides me with a simple and concise summary. As far as I’m concerned, you would be better off asking users for their consent before storing their data, before even thinking about anonymization.

. My intention is focusing on replicating the production environment as close as possible in the context of functional testing and analytic.

Suppression or masking

The idea is simple, replacing the piece of data you want to anonymize with blank values. This is probably the simplest techniques among the bunch, and is the first one came cross my mind. Suppression can be done using just SQL, for example:

UPDATE users SET name = '<CONFIDENTIAL>';
UPDATE users SET email = NULL;

One variation of this technique is to only suppress/mask part of the data and expose other parts for analytic purpose. For example, one might want to retain the country code portion from the phone number, or the country/city part of the address of their users.

UPDATE addresses SET street = '<CONFIDENTIAL>';
UPDATE users SET phone = substring(phone FROM 1 FOR 3) || 'XX XX XX';

This technique is quite useful for functional testing, and it can be use (with limitations) for analytic. However, there are quite a few downside to it too:

  • It is completely ad hoc, you will need to be aware of all the fields/attributes of your data and do the suppression/masking to all of the relevant fields to be able to achieve complete anonymity.
  • It is possible to break constraints such as unique or foreign key existence with this technique, so it is not always feasible.

Substitution, with random and fake values

I saw this technique more a like a complement of suppression/masking. Remember those faker The original Perl Data::Faker library. There are ports for most mainstream languages.

libraries we usually used to generate life-like random data during testing? Turns out there are several faker-like libraries https://github.com/garysweaver/pg_faker which is a Ruby gem. Looks unmaintained to me tho.

that can generate plausible data in your database.

Beside all the downside of suppression/masking, this technique has a few downside specific to it too:

  • It requires generating fake values, which may or may not be easily done depends on your database engine. Even for PostgreSQL, one of the most popular database engines, support is still quite limited. In worst-case scenario where extension/pre-made solutions are not available you will need to write the solution yourselves.
  • Data processing take time, even in the case of database engine extensions, it could still be infeasible if you need to generate millions of rows of fake data.

Having said that, this technique does fix the constraint issue of suppression/masking to some degree, and it probably the one provides the highest anonymity level if done correctly.

Sampling

Depending on your requirements you might be able to get away with taking a small portion of your total dataset, or in other word - sampling. Sampling does provide a anonymity to some degree, and is very useful if your production dataset is large. A portion of real data is often proved to be more useful than the full set of garbage. And the reduced data volume is very convenient if it is intended to use in smaller environments such as local/development or staging/QA.

If you are using PostgreSQL, there is pg_sample, which is a command line utility that help you do sampling easily while still retaining referential integrity.

Like previously mentioned techniques, there are downside when it comes to sampling:

  • Without special tool, it is often hard (or impossible) to maintain data integrity.
  • Anonymity level is low, but can overcome by combine with other techniques.
  • Depending on your sampling method, analytic over the data is no longer useful since the sample might not be good enough to represent the larger dataset. Even if tried your best, the analytic is still need to be take into account variance and deviations because of sampling.

Shuffling

The final techniques I am presenting, and also is the technique I ended up with, which is shuffling. The idea is shuffle the data around so it is no longer possible to identify/link some piece of data with the users. This is especially useful for analytic purposes, since the entries as a whole were not altered at all, retaining many of its characteristics such as the sum (total), average, min, max .etc.

It requires a somewhat complicated SQL query, but can still be done totally in SQL, giving it an edge if you are working with less popular database engines. Following is a real SQL I used, to give you some idea:

WITH
m1 AS (
  SELECT row_number() over (order by random()) n,
  content AS content1
  FROM messages
),
m2 AS (
  SELECT row_number() over (order by random()) n,
  id AS id2
  FROM messages
)
UPDATE messages
SET content = m1.content1
FROM m1 join m2 ON m1.n = m2.n
WHERE id = m2.id2;

Consideration for using shuffling is anonymity level and the volume of the data. For my use-case, which is function/performance testing I considered the best overall, providing a great balance between simplicity, usefulness and anonymity level.

Conclusion

In the end, I went with suffling, considering our database is small enough that the suffle queries can be run in reasonable amount of time. Following is a quick decision table you can use as reference:

Technique Applicable data types Use-case
Suppression / masking Phone number, Address, Email address Analytic
Substitution Any data types Functional testing
Sampling Any data types Functional testing, Analytic (limited)
Shuffling Any data types Functional testing, Analytic