The world of online shopping is growing extremely quickly. As a consequence, the importance of targeting is also really important and, therefore, companies are making great use of the enormous availability of data. The implementation of cookies on almost every website nowadays makes it possible for companies to follow our online behaviour click by click. This way, they can provide personal advertisements, providing us with relevant products online while their targeting effectiveness increases massively. However, with this growing world of data collection and adapted targeting also come a lot of dangers for customers. Risks like price discrimination based on individual characteristics or recollecting data about someone’s health records are serious threats if data comes into the hands of ones that put profit over ethical interests. Our privacy is reducing rapidly while companies keep collecting more and more data. How can our privacy be guaranteed while also keeping data collection useful for companies?
One of the methods used to try and preserve privacy is called k-anonymity [1]. K-anonymity consists of two other simple methods used to make it harder to re-collect information about individuals. The first one is called suppression, which just implies that certain variables are removed from the dataset, for example your name, age etc. The other method is called generalization, which implies that very exact and specific data is turned into more general information, for example, using intervals for someone’s age instead of the exact age. Either one of the two methods or both can be used to make a dataset k-anonymized, which implies that each person in the dataset cannot be distinguished from at least k-1 individuals who are also in the dataset.
K-anonymity seems to provide a safe environment to preserve the privacy of individual customers, but Narayanan and Shmatikov (2008) showed that it isn’t as easy as just removing data to make sure no personal data can be recollected [2]. Netflix used k-anonymity to preserve the privacy of their customers. They omitted certain personal data so that customers weren’t identifiable. Narayanan and Shmatikov (2008), however, combined Netflix’s dataset with a dataset of IMDb and they were able to re-collect data about the individual customers of Netflix. Although it wasn’t possible to distinguish customers in the Netflix dataset alone, the use of a different dataset, where similar data was gathered, made it again possible, since they now had more data available about probably a significant amount of overlapping people for both Netflix and IMDb. It shows that the use of multiple datasets makes the weaknesses appear of just omitting certain variables in the dataset. Hence it seems that k-anonymity isn’t a suitable solution, so what other methods could potentially provide a privacy guarantee?
This brings us to differential privacy [3]. This method focuses on adding noise to the data, which implies that some of the information in the dataset is edited. This added noise is for example based on a normal distribution. This way, the overall output of data analyses will still provide useful output, as over a big enough dataset the added noise will cancel out, while also protecting the privacy of individuals. In essence, something is differentially private if comparing two datasets, which are identical except in one data set one observation is left out, will still provide the same output. From an individual’s perspective, this implies that you should be indifferent about actually sharing your information or not, as the company is not going to obtain new useful information from your participation in the dataset.
The concept of differential privacy might sound a bit vague and abstract, but it actually takes away a lot of the privacy risks. With differential privacy, you can protect any type of information, no matter what a potential attacker might already know about your dataset. Furthermore, this method makes it possible to quantify the loss of privacy. Differential privacy namely works with a privacy budget, where a higher privacy budget leads to a higher privacy loss. This way, we can actually say something concrete about how much privacy is lost when the privacy budget slightly increases.
The explanation above probably still sounds a bit vague, so if you want to learn more about differential privacy, I recommend you to check out Ted is writing things [4]. He provides a clear and easy-to-read explanation of differential privacy and elaborates more on the mathematical definition.
Differential privacy seems to provide a strong base in the search for methods that find the “right” balance between privacy and targeting utility. However, a lot is still unknown about its effectiveness and with data collection still growing rapidly while also becoming more detailed, it will be exciting to see what the future will bring.
References
[1] Samarati, P. and Sweeney, L. (1998). Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression.
[2] Narayanan, A. and Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008), pages 111–125. IEEE.
[3] Dwork, C. (2006). Differential privacy: lecture notes in computer science, vol. 26.
[4] Ted is writing things: https://desfontain.es/privacy/friendly-intro-to-differential-privacy.html