Algorithms for synthetic data release under differential privacy

Releasing sensitive data while preserving privacy is an important problem that has attracted considerable attention in recent years. The state-of-the-art paradigm for addressing this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assump...

Full description

Saved in:
Bibliographic Details
Main Author: Zhang, Jun
Other Authors: Xiao Xiaokui
Format: Theses and Dissertations
Language:English
Published: 2016
Subjects:
Online Access:https://hdl.handle.net/10356/69204
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-69204
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Information systems::Database management
spellingShingle DRNTU::Engineering::Computer science and engineering::Information systems::Database management
Zhang, Jun
Algorithms for synthetic data release under differential privacy
description Releasing sensitive data while preserving privacy is an important problem that has attracted considerable attention in recent years. The state-of-the-art paradigm for addressing this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Most efforts to date to perform differentially private data release end up mired in complexity, overwhelm the signal with noise, and are not effective for use in practice. In this thesis, we introduce three novel solutions for complex data publication under differential privacy, namely, PrivBayes, PrivTree and the ladder framework. Compared to the previous work, our methods (i) enable the private release of a wide range of data types, i.e., multi-dimensional tabular data, spatial data, sequence data and graph data, (ii) improve the utility of released data by introducing significantly less perturbations in data modelling and (iii) are query-independent, such that many different queries (linear or non-linear) can be accurately evaluated on the same set of released data. First of all, we present PrivBayes, a differentially private method for releasing multi-dimensional tabular data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy, and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the full-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data, and demonstrate that it significantly outperforms existing solutions in terms of accuracy. Second, we introduce PrivTree, a differentially private algorithm for releasing spatial and sequential datasets. Given a set D of tuples defined on a domain Omega, PrivTree constructs a histogram over Omega to approximate the tuple distribution in D. It adopts a hierarchical decomposition approach, which recursively splits Omega into sub-domains and computes a noisy tuple count for each sub-domain, until all noisy counts are below a certain threshold. Previous efforts based on hierarchical decomposition require that we (i) impose a limit h on the recursion depth in the splitting of Omega and (ii) set the noise in each count to be proportional to h. The choice of h is a serious dilemma: a small h makes the resulting histogram too coarse-grained, while a large h leads to excessive noise in the tuple counts used in deciding whether sub-domains should be split. PrivTree completely eliminates its dependency on a pre-defined h. The rationale behind is a novel mechanism that (i) exploits a new analysis on the Laplace distribution and (ii) enables us to use only a constant amount of noise in deciding whether a sub-domain should be split, without worrying about the recursion depth of splitting. We demonstrate the application of PrivTree in modelling spatial data, and show that it can be extended to handle sequence data (where the decision in sub-domain splitting is not based on tuple counts but a more sophisticated measure). Our experiments on a variety of real datasets show that PrivTree considerably outperforms the states of the art in terms of data utility. Last, we investigate the problem of protecting the privacy of individuals in graph structured data, and propose a differentially private solution, namely, the ladder framework. It specifies a probability distribution over possible outputs that is carefully defined to maximize the utility for the given input, while still providing the required privacy level. The distribution is designed to form a 'ladder', so that each output achieves the highest 'rung' (maximum probability) compared to less preferable outputs. We show how our ladder framework can be applied to problems of counting the number of occurrences of subgraphs, a vital objective in graph analysis, and give algorithms whose cost is comparable to that of computing the count exactly. Our experimental study confirms that our method outperforms existing methods for counting triangles and stars in terms of accuracy, and provides solutions for some problems for which no effective method was previously known. The results of our algorithms can be used to estimate the parameters of suitable graph models, allowing synthetic graphs to be sampled and released.
author2 Xiao Xiaokui
author_facet Xiao Xiaokui
Zhang, Jun
format Theses and Dissertations
author Zhang, Jun
author_sort Zhang, Jun
title Algorithms for synthetic data release under differential privacy
title_short Algorithms for synthetic data release under differential privacy
title_full Algorithms for synthetic data release under differential privacy
title_fullStr Algorithms for synthetic data release under differential privacy
title_full_unstemmed Algorithms for synthetic data release under differential privacy
title_sort algorithms for synthetic data release under differential privacy
publishDate 2016
url https://hdl.handle.net/10356/69204
_version_ 1759855225541230592
spelling sg-ntu-dr.10356-692042023-03-04T00:47:31Z Algorithms for synthetic data release under differential privacy Zhang, Jun Xiao Xiaokui School of Computer Engineering DRNTU::Engineering::Computer science and engineering::Information systems::Database management Releasing sensitive data while preserving privacy is an important problem that has attracted considerable attention in recent years. The state-of-the-art paradigm for addressing this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Most efforts to date to perform differentially private data release end up mired in complexity, overwhelm the signal with noise, and are not effective for use in practice. In this thesis, we introduce three novel solutions for complex data publication under differential privacy, namely, PrivBayes, PrivTree and the ladder framework. Compared to the previous work, our methods (i) enable the private release of a wide range of data types, i.e., multi-dimensional tabular data, spatial data, sequence data and graph data, (ii) improve the utility of released data by introducing significantly less perturbations in data modelling and (iii) are query-independent, such that many different queries (linear or non-linear) can be accurately evaluated on the same set of released data. First of all, we present PrivBayes, a differentially private method for releasing multi-dimensional tabular data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy, and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the full-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data, and demonstrate that it significantly outperforms existing solutions in terms of accuracy. Second, we introduce PrivTree, a differentially private algorithm for releasing spatial and sequential datasets. Given a set D of tuples defined on a domain Omega, PrivTree constructs a histogram over Omega to approximate the tuple distribution in D. It adopts a hierarchical decomposition approach, which recursively splits Omega into sub-domains and computes a noisy tuple count for each sub-domain, until all noisy counts are below a certain threshold. Previous efforts based on hierarchical decomposition require that we (i) impose a limit h on the recursion depth in the splitting of Omega and (ii) set the noise in each count to be proportional to h. The choice of h is a serious dilemma: a small h makes the resulting histogram too coarse-grained, while a large h leads to excessive noise in the tuple counts used in deciding whether sub-domains should be split. PrivTree completely eliminates its dependency on a pre-defined h. The rationale behind is a novel mechanism that (i) exploits a new analysis on the Laplace distribution and (ii) enables us to use only a constant amount of noise in deciding whether a sub-domain should be split, without worrying about the recursion depth of splitting. We demonstrate the application of PrivTree in modelling spatial data, and show that it can be extended to handle sequence data (where the decision in sub-domain splitting is not based on tuple counts but a more sophisticated measure). Our experiments on a variety of real datasets show that PrivTree considerably outperforms the states of the art in terms of data utility. Last, we investigate the problem of protecting the privacy of individuals in graph structured data, and propose a differentially private solution, namely, the ladder framework. It specifies a probability distribution over possible outputs that is carefully defined to maximize the utility for the given input, while still providing the required privacy level. The distribution is designed to form a 'ladder', so that each output achieves the highest 'rung' (maximum probability) compared to less preferable outputs. We show how our ladder framework can be applied to problems of counting the number of occurrences of subgraphs, a vital objective in graph analysis, and give algorithms whose cost is comparable to that of computing the count exactly. Our experimental study confirms that our method outperforms existing methods for counting triangles and stars in terms of accuracy, and provides solutions for some problems for which no effective method was previously known. The results of our algorithms can be used to estimate the parameters of suitable graph models, allowing synthetic graphs to be sampled and released. DOCTOR OF PHILOSOPHY (SCE) 2016-11-26T01:48:59Z 2016-11-26T01:48:59Z 2016 Thesis Zhang, J. (2016). Algorithms for synthetic data release under differential privacy. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/69204 10.32657/10356/69204 en 159 p. application/pdf