2019-05-16 03:30:21 2019-05-16 03:30:21 2019-05-16 03:30:21 8581642
Traditional methods of data de-identification obscure data values. For
example, you might truncate a date to just the year.
Differential privacy obscures query values by injecting enough noise
to keep from revealing information on an individual.
Let’s compare two approaches for de-identifying a person’s age:
truncation and differential privacy.
First consider truncating birth date to year. For example, anyone born
between January 1, 1955 and December 31, 1955 would be recorded as being
born in 1955. This effectively produces a 100% confidence interval that
is one year wide.
Next we’ll compare this to a 95% confidence interval using
Differential privacy adds noise in proportion to the sensitivity Δ of a
query. Here sensitivity means the maximum impact that one record could
have on the result. For example, a query that counts records has
Suppose people live to a maximum of 120 years. Then in a database with
n records , one person’s presence in or absence from the database
would make a difference of no more than 120/n years, the worst case
corresponding to the extremely unlikely event of a database of n-1
newborns and one person 120 year old.
The Laplace mechanism implements ε-differential privacy by adding noise
with a Laplace(Δ/ε) distribution, which in our example means
A 95% confidence interval for a Laplace distribution with scale b
centered at 0 is
[b log 0.05, –b log 0.05]
which is very nearly
In our case b = 120/nε, and so a 95% confidence interval for the
noise we add would be [-360/nε, 360/nε].
When n = 1000 and ε = 1, this means we’re adding noise that’s usually
between -0.36 and 0.36, i.e. we know the average age to within about 4
months. But if n = 1, our confidence interval is the true age ± 360.
Since this is wider than the a priori bounds of [0, 120], we’d
truncate our answer to be between 0 and 120. So we could query for the
age of an individual, but we’d learn nothing.
The width of our confidence interval is 720/ε, and so to get a
confidence interval one year wide, as we get with truncation, we would
set ε = 720. Ordinarily ε is much smaller than 720 in application, say
between 1 and 10, which means differential privacy reveals far less
information than truncation does.
Even if you truncate age to decade rather than year, this still
reveals more information than differential privacy provided ε < 72.
private, but we’ll assume here that for some reason we know the number
of rows a priori.
#johndcook #Math #Privacy #ProbabilityandStatistics