Never Walk Alone: Uncertainty for Anonymity in Moving Objects Databases
Osman Abul, Francesco Bonchi and Mirco Nanni
In The 24th International Conference on Data Engineering (ICDE 2008)
Preserving individual privacy when publishing data is
a problem that is receiving increasing attention. According
to the k-anonymity principle, each release of data must be
such that each individual is indistinguishable from at least
k - 1 other individuals. In this paper we study the problem
of anonymity preserving data publishing in moving objects
databases. We propose a novel concept of k-anonymity
based on co-localization that exploits the inherent uncertainty
of the moving object's whereabouts. Due to sampling
and positioning systems (e.g., GPS) imprecision, the trajectory
of a moving object is no longer a polyline in a threedimensional
space, instead it is a cylindrical volume, where
its radius δ represents the possible location imprecision: we
know that the trajectory of the moving object is within this
cylinder, but we do not know exactly where. If another object
moves within the same cylinder they are indistinguishable
from each other. This leads to the definition of (k,δ)-
anonymity for moving objects databases.
We first characterize the (k,δ)-anonymity problem and
discuss techniques to solve it. Then we focus on the most
promising technique by the point of view of information
preservation, namely space translation. We develop a suitable
measure of the information distortion introduced by
space translation, and we prove that the problem of achieving
(k,δ)-anonymity by space translation with minimum
distortion is NP-hard. Faced with the hardness of our problem
we propose a greedy algorithm based on clustering and
enhanced with ad hoc pre-processing and outlier removal
techniques. The resulting method, named NWA (Never
Walk Alone), is empirically evaluated in terms of data
quality and efficiency.
Data quality is assessed both by means of objective measures
of information distortion, and by comparing the results
of the same spatio-temporal range queries executed
on the original database and on the (k,δ)-anonymized one.
Experimental results show that for a wide range of values of
δ and k, the relative error introduced is kept low, confirming
that NWA produces high quality (k,δ)-anonymized data.
Input and output datasets have the same format: an ASCII file containing 4 TAB-separated columns.
The columns correspond to:
- Trajectory ID (integer),
- Timestamp (integer),
- X coordinate (float),
- Y coordinate (float).
The
input file has to meet the following constraints:
- entries with the same ID should be adjacent (not necessarily sorted by time);
- the (integer) timestamps that describe a trajectory have to span over a continuous interval, i.e., they should form a unique time interval without gaps (e.g., 2,3,4,5 is OK, while 1,2,4,5 is not).
Here is an example of input file, containing two trajectories (having ID=501 and ID=504) resp. with 6 and 4 points:
501 5 11401.719 24961.338
501 2 11581.854 24691.925
501 0 11593.253 24736.460
501 1 11611.838 24722.498
501 3 11470.533 24822.748
501 4 11444.096 24922.036
504 4 11450.540 24785.989
504 2 11570.155 24705.881
504 3 11493.856 24732.754
504 5 11398.027 24812.087
(Notice that trajectory 501 has timestamps 0,1,2,3,4,5 and trajectory 504 has 2,3,4,5, both of them forming intervals without gaps).
To see the usage syntax and a short description of all options, invoke the program without parameters.
When it is launched with the proper parameters, the software will print on the standard output the status of the computation and some statistics of the dataset.
Moreover, if the option
-stats is specified, at the end of the computation a single line of text is printed on the standard error, containing statistics collected through the computation. In particular, that line of text contains, in the order, the following information:
- input_file, K, delta, pi, delta_max, trash_max: input parameters (see the paper for further details)
- N. trajectories that "survived" after the preprocessing
- N. trajectories deleted in the preprocessing
- N. points that "survived" after the preprocessing
- N. points deleted in the preprocessing
- Diameter of the spatial bounding-box of the dataset
- Largest equivalence class
- Longest trajectory as n. of points
- N. of equivalence classes
- N. Equivalvence classes removed becose too small
- Trash size
- Trash size expressed as deleted points
- N. trajectories that underwent spatial translation
- N. translated points
- TDC metrics
- Largest cluster radius outputted by NWA procedure
- Max point translation
- DM metrics
- ID metrics
- CM metrics
- Total user execution time
- Total system execution time
- Total execution time