Promises of Privacy and Anonymity
Consumer genetic testing companies are expected to be worth $45 billion by 2024. This industry has experienced rapid growth, especially in the last decade. All it takes is shipping a saliva sample in the mail and in less than six weeks you can discover relatives you never knew existed, your ethnicity estimate, find out whether you are the carrier of a genetic mutation (that could predispose you to certain diseases or health complications) and much more. There is one huge drawback though..genomic data is the ultimate individual identifier.
All of the large consumer genetic testing companies claim in their privacy policies that the genetic data of customers is de-identified. In most cases the agency will let you choose whether or not you want to opt-in or opt-out of genetic research with a third party. (Approximately 80% of 23 & Me customers opt-in to allow their de-identified genetic data be used for research) Drugmaker GlaxoSmithKline PLC recently paid 23 & Me $300 Million for access to their de-identified genetic database. In most cases your PII (Personally Identifying Information) is stored completely separate from any genetic information. Your personal information is assigned a random customer identification number. Your genetic data is only identified using a barcode system. Keeping PII and genetic data in physically separate computing environments is considered the industry standards for security.
Consumers Top Concerns
Most people that I’ve talked to are the most concerned about insurance companies and employers getting a hold of this information and what could happen after the fact. (e.g. Denying coverage, loss of employment, discrimination) Another concern is the fact that your genomic data doesn’t just tell us a lot about you but also a lot about those closely related to you. To add to it: Law Enforcement can now obtain anyone’s DNA data from one of the testing agencies and all it takes is a court order. Most of the time genetic/forensic genealogists help assist law enforcement and use public genealogy databases. (Most of the time they use gedmatch.com) Gedmatch is used by a lot of genealogists because it is free and you can upload your raw DNA data from almost any of the popular genetic testing companies. A perfect example of this method being used is the case of the Golden State Killer. The California police had the Golden State Killer’s well-preserved DNA from one of the crime scenes. When they uploaded the DNA data they didn’t find the killer himself on the database but they found close enough relatives that they were able to look at their family trees and shared chromosome segments to determine the killer’s identity and ultimately to solve these murders which had gone unsolved for decades. Solving this case was a positive thing, however, it made a lot of people start to question the ethics and laws involved in our genetic privacy. **Note: As of May 18th, 2019 GEDmatch made a revision of their Terms of Service and Privacy Policy..They are now giving their users an option to “opt-in” or “opt-out” as defined as..'Public + opt-in' DNA data is available for comparison to any Raw Data in the GEDmatch database using the various tools provided for that purpose. 'Public + opt-out' DNA data is available for comparison to any Raw Data in the GEDmatch database, except DNA kits identified as being uploaded for Law Enforcement purposes.**
The genetic testing company’s promise of our data being de-identified should make us feel better, right? Here’s the problem though: with the use of machine learning de-identified data can now be re-identified. Genomic data is highly distinguishable. There are approximately 5 million SNPs (Single Nucleotide Polymorphisms) in a person’s Genome, it has been reported that a sequence of only 30 to 80 SNPs is enough to uniquely identify an individual. Keep in mind that genetic variation from individual to individual is only about 0.5%. Most of the genetic testing technologies are already using some form of machine learning or AI(Artificial Intelligence) to function. DNA sequences are often un-ordered and unstructured. Bioinformatics is what allows us to put the DNA in an order that makes it usable. Population Geneticists use one-hot encoding to transform the DNA alphabet into a binary code. This puts the data into a format that can be used with deep learning. It also reduces the data size by throwing away all the non-mutated locations in the genome because they don’t carry the information that is useful to us. (The only data that is useful to us is the part of your DNA that is DIFFERENT than others)
Examples of Re-identification
In 2008 James Watson (with Frances Crick discovered the double-helix DNA model) decided to release his sequenced genome to a public database, however, he made the decision to leave out his APOE gene. APOE= Apolipoprotein E. APOE is a gene on chromosome 19 that is involved in making a protein that helps carry cholesterol and other types of fat in the blood stream The APOE E4 allele is the major known risk-factor gene for late-onset Alzheimers disease. Later on a statistical model was developed that was able to infer Watson’s missing gene with a very high degree of confidence.
Also, in a recent study the authors were able to infer the identity of 50 anonymous male participants whose Y-DNA had been sequenced for the 1,000 Genomes Project. Researchers were not only able to discover the identities of the anonymized men but they were also able to figure out who their family members were using publicly available pedigrees.
In Part 2 I discuss Genetic Privacy Laws and what can be done.. https://www.bentleybiosec.com/blog/2019/genetictestingconcerns2