Algorithms for synthetic data release under differential privacy

Releasing sensitive data while preserving privacy is an important problem that has attracted considerable attention in recent years. The state-of-the-art paradigm for addressing this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assump...

Full description

Saved in:

Bibliographic Details
Main Author:	Zhang, Jun
Other Authors:	Xiao Xiaokui
Format:	Theses and Dissertations
Language:	English
Published:	2016
Subjects:	DRNTU::Engineering::Computer science and engineering::Information systems::Database management
Online Access:	https://hdl.handle.net/10356/69204
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-69204
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Information systems::Database management
spellingShingle	DRNTU::Engineering::Computer science and engineering::Information systems::Database management Zhang, Jun Algorithms for synthetic data release under differential privacy
description	Releasing sensitive data while preserving privacy is an important problem that has attracted considerable attention in recent years. The state-of-the-art paradigm for addressing this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Most efforts to date to perform differentially private data release end up mired in complexity, overwhelm the signal with noise, and are not effective for use in practice. In this thesis, we introduce three novel solutions for complex data publication under differential privacy, namely, PrivBayes, PrivTree and the ladder framework. Compared to the previous work, our methods (i) enable the private release of a wide range of data types, i.e., multi-dimensional tabular data, spatial data, sequence data and graph data, (ii) improve the utility of released data by introducing significantly less perturbations in data modelling and (iii) are query-independent, such that many different queries (linear or non-linear) can be accurately evaluated on the same set of released data. First of all, we present PrivBayes, a differentially private method for releasing multi-dimensional tabular data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy, and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the full-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data, and demonstrate that it significantly outperforms existing solutions in terms of accuracy. Second, we introduce PrivTree, a differentially private algorithm for releasing spatial and sequential datasets. Given a set D of tuples defined on a domain Omega, PrivTree constructs a histogram over Omega to approximate the tuple distribution in D. It adopts a hierarchical decomposition approach, which recursively splits Omega into sub-domains and computes a noisy tuple count for each sub-domain, until all noisy counts are below a certain threshold. Previous efforts based on hierarchical decomposition require that we (i) impose a limit h on the recursion depth in the splitting of Omega and (ii) set the noise in each count to be proportional to h. The choice of h is a serious dilemma: a small h makes the resulting histogram too coarse-grained, while a large h leads to excessive noise in the tuple counts used in deciding whether sub-domains should be split. PrivTree completely eliminates its dependency on a pre-defined h. The rationale behind is a novel mechanism that (i) exploits a new analysis on the Laplace distribution and (ii) enables us to use only a constant amount of noise in deciding whether a sub-domain should be split, without worrying about the recursion depth of splitting. We demonstrate the application of PrivTree in modelling spatial data, and show that it can be extended to handle sequence data (where the decision in sub-domain splitting is not based on tuple counts but a more sophisticated measure). Our experiments on a variety of real datasets show that PrivTree considerably outperforms the states of the art in terms of data utility. Last, we investigate the problem of protecting the privacy of individuals in graph structured data, and propose a differentially private solution, namely, the ladder framework. It specifies a probability distribution over possible outputs that is carefully defined to maximize the utility for the given input, while still providing the required privacy level. The distribution is designed to form a 'ladder', so that each output achieves the highest 'rung' (maximum probability) compared to less preferable outputs. We show how our ladder framework can be applied to problems of counting the number of occurrences of subgraphs, a vital objective in graph analysis, and give algorithms whose cost is comparable to that of computing the count exactly. Our experimental study confirms that our method outperforms existing methods for counting triangles and stars in terms of accuracy, and provides solutions for some problems for which no effective method was previously known. The results of our algorithms can be used to estimate the parameters of suitable graph models, allowing synthetic graphs to be sampled and released.
author2	Xiao Xiaokui
author_facet	Xiao Xiaokui Zhang, Jun
format	Theses and Dissertations
author	Zhang, Jun
author_sort	Zhang, Jun
title	Algorithms for synthetic data release under differential privacy
title_short	Algorithms for synthetic data release under differential privacy
title_full	Algorithms for synthetic data release under differential privacy
title_fullStr	Algorithms for synthetic data release under differential privacy
title_full_unstemmed	Algorithms for synthetic data release under differential privacy
title_sort	algorithms for synthetic data release under differential privacy
publishDate	2016
url	https://hdl.handle.net/10356/69204
_version_	1759855225541230592
spelling	sg-ntu-dr.10356-692042023-03-04T00:47:31Z Algorithms for synthetic data release under differential privacy Zhang, Jun Xiao Xiaokui School of Computer Engineering DRNTU::Engineering::Computer science and engineering::Information systems::Database management Releasing sensitive data while preserving privacy is an important problem that has attracted considerable attention in recent years. The state-of-the-art paradigm for addressing this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Most efforts to date to perform differentially private data release end up mired in complexity, overwhelm the signal with noise, and are not effective for use in practice. In this thesis, we introduce three novel solutions for complex data publication under differential privacy, namely, PrivBayes, PrivTree and the ladder framework. Compared to the previous work, our methods (i) enable the private release of a wide range of data types, i.e., multi-dimensional tabular data, spatial data, sequence data and graph data, (ii) improve the utility of released data by introducing significantly less perturbations in data modelling and (iii) are query-independent, such that many different queries (linear or non-linear) can be accurately evaluated on the same set of released data. First of all, we present PrivBayes, a differentially private method for releasing multi-dimensional tabular data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy, and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the full-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data, and demonstrate that it significantly outperforms existing solutions in terms of accuracy. Second, we introduce PrivTree, a differentially private algorithm for releasing spatial and sequential datasets. Given a set D of tuples defined on a domain Omega, PrivTree constructs a histogram over Omega to approximate the tuple distribution in D. It adopts a hierarchical decomposition approach, which recursively splits Omega into sub-domains and computes a noisy tuple count for each sub-domain, until all noisy counts are below a certain threshold. Previous efforts based on hierarchical decomposition require that we (i) impose a limit h on the recursion depth in the splitting of Omega and (ii) set the noise in each count to be proportional to h. The choice of h is a serious dilemma: a small h makes the resulting histogram too coarse-grained, while a large h leads to excessive noise in the tuple counts used in deciding whether sub-domains should be split. PrivTree completely eliminates its dependency on a pre-defined h. The rationale behind is a novel mechanism that (i) exploits a new analysis on the Laplace distribution and (ii) enables us to use only a constant amount of noise in deciding whether a sub-domain should be split, without worrying about the recursion depth of splitting. We demonstrate the application of PrivTree in modelling spatial data, and show that it can be extended to handle sequence data (where the decision in sub-domain splitting is not based on tuple counts but a more sophisticated measure). Our experiments on a variety of real datasets show that PrivTree considerably outperforms the states of the art in terms of data utility. Last, we investigate the problem of protecting the privacy of individuals in graph structured data, and propose a differentially private solution, namely, the ladder framework. It specifies a probability distribution over possible outputs that is carefully defined to maximize the utility for the given input, while still providing the required privacy level. The distribution is designed to form a 'ladder', so that each output achieves the highest 'rung' (maximum probability) compared to less preferable outputs. We show how our ladder framework can be applied to problems of counting the number of occurrences of subgraphs, a vital objective in graph analysis, and give algorithms whose cost is comparable to that of computing the count exactly. Our experimental study confirms that our method outperforms existing methods for counting triangles and stars in terms of accuracy, and provides solutions for some problems for which no effective method was previously known. The results of our algorithms can be used to estimate the parameters of suitable graph models, allowing synthetic graphs to be sampled and released. DOCTOR OF PHILOSOPHY (SCE) 2016-11-26T01:48:59Z 2016-11-26T01:48:59Z 2016 Thesis Zhang, J. (2016). Algorithms for synthetic data release under differential privacy. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/69204 10.32657/10356/69204 en 159 p. application/pdf

Algorithms for synthetic data release under differential privacy

Similar Items