The Intent behind Data Anonymization is privacy protection. Typically this is done with either encryption or removing Personally Identifiable Information (PII) from the sets of data. In theory data that has been anonymized is supposed to be irreversibly altered so that the subject can not be identified, directly or indirectly. The process of De-anonymization is when anonymous data is cross-examined with other sources of data to re-identify the anonymous data source. Not to be confused with Pseudonymization which is the process of obscuring data with the intent to be able to re-identify it later on. Storing data this way is still considered to be HIPAA compliant.
I am mainly going to focus on Genetic data for this blog. The industry of Direct-to-consumer Genetic Testing just keeps expanding. Not only do the options keep increasing for where you can test but more and more people are testing leading to larger databases. All of the big companies assure you in their privacy policies that your genetic data is safe because it is de-identified; the only problem with that is it’s not necessarily true. In most cases it is in fact de-identified (although that isn’t always the case either) however, they fail to mention that your genetic data can in fact be re-identified. There are many cases of genetic data being re-identified..way more than any of these companies would like you to know about.
In January of 2013 a researcher at the Whitehead Institute (Part of MIT) was able to track down 5 individuals who were randomly selected from a DNA database using only their DNA, ages, and the states they lived in. The worst part is it only took him a few hours. Not only that, but he was also able to find 50 relatives of the individuals. That is something important to remember: When your DNA data is exposed it isn’t just YOUR DNA but also the DNA of your descendants who will inherit a percentage of your genes. I provide more examples of re-identification in this blog.
We’ve been able to identify an individual with as little as 30 to 80 SNP’S (Single Nucleotide Polymorphisms) Keep in mind you have approximately 5 million of those, total. And it’s even easier to identify you if you are an individual with primarily Northern European heritage because this accounts for a HUGE percentage of the genomic data available on public databases. When it comes to public databases it’s easy to identify an individual if they are a third cousin match or closer. (The further removed the harder it becomes)
I am going to show you some samples from public databases after we upload raw DNA and free software we can use to analyze it. I will not be including any real names or personal information however, I will be using real raw DNA that I have permission to use.
The main reason people upload their raw DNA is because they want to know more about their family history or to locate family. So more times than not users include their full names AND email, making it even easier to identify an anonymous user. Many of these individuals will also link their publicly accessible pedigrees. There are also sites like familysearch.org that lets you search through records, family trees, and other sources for free (after you create an account). Using two or more sources or pieces of information to identify people is called Jigsaw Re-identification. To help prove my point about the power of raw DNA combined with Open Source Intelligence (OSINT) Click here to see a list of criminals that were identified with the help of GEDmatch and Genetic Genealogists. Note: Starting in May 2019 (due to privacy complaints) GEDmatch now gives you the option to ‘opt-out’ of allowing your DNA to be involved in searches and comparisons for Law Enforcement.
Next I’m going to show samples of comparing matches on the 23rd chromosome, this is the chromosome that determines your sex. Women inherit an X-chromosome from each parent. A man inherits an X-chromosome from his mother and a Y-chromosome from his father. Most public genetic databases focus primarily on autosomal DNA. (This is DNA that is not involved in determining sex and it is not passed down in-tact from generation to generation the same way that X and Y-DNA is) All of the main direct-to-consumer genetic testing companies primarily use autosomal SNPs to determine biological closeness and ethnicity. This testing examines over 700,000 SNPs.