Differential privacy while sharing Sensitive data (I).

Som
Nerd For Tech
Published in
4 min readJul 9, 2022

--

When the data is shared publically across various companies there might be a lot of private user data involved. To share the data and at the same time keep the privacy of the data attributes we use differential privacy. Differential privacy is the mathematical definition of privacy.

Differential privacy (DP) is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. (Source wikipedia)

There are two types of differential privacy:

  • local differential privacy
  • Global differential
source: Datacamp data privacy and anonymization in python
source: DataCamp data privacy and anonymization in python

Epsilon differential privacy

epsilon is the measure of privacy

source datacamp

We can manage differential privacy by using epsilon differential privacy. Where the epsilon is the privacy measure. The lower the epsilon is higher the privacy is. As we increase the epsilon value the privacy differential increases exponentially. The epsilon values between 0 to 1 are considered highly private. the values between 2 to 10 for epsilon are considered better to have than nothing. And the values above 10 are almost sharing the exact data exposing user privacy.

A system of epsilon = 1 is 8000 times more private than epsilon = 10

K anonymity

source datacamp

Let's see What difference the differential privacy does?

suppose we have to share the data from one organization to another one but at the same time, we want to keep the user data private. Let’s assume that here we are sharing some variable floating value across the different organizations.
The resulting histogram of nonprivate and private histograms looks like the following.

source datacamp

Here you’ll notice that the histogram looks almost similar (the X-axis values) but at the same time, we have privatized the y-axis parameter.

Why do the organizations such as Apple uses differential privacy?

It enables Apple to learn about the user community without learning about individuals in the community. Differential privacy transforms the information shared with Apple before it ever leaves the user’s device such that Apple can never reproduce the true data.

Privacy budget of organization:

Different organizations have set different privacy budgets. You can think of the privacy budget as the measure of which extinct the organization collects the data from the user. While collecting the data organization uses this privacy budget.

source datacamp

If the third-party queries the data 2 times with epsilon as 1 from the data curator the third party collects the data actually of epsilon of value 4. Where the data collected from each query can be aggregated to reduce the privacy of the user.

source: DataCamp

Then what is private enough?

Top emoji used by English users collected by Apple:

Sources:

  1. Data Privacy and Anonymization in Python by DataCamp:

2. Wikipedia article on differential privacy:

3 Apple differential privacy overview:
https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf

Stay tuned for the next blog on calculating the privacy budgets and differentially privating the machine learning models. If you liked this blog please don’t forgot to upvote and leave a comment it encourages me to write more 😃

--

--