K-means clustering with local dᵪ-privacy for privacy-preserving data analysis
Privacy-preserving data analysis is an emerging area that addresses the dilemma of performing data analysis on user data while protecting users' privacy. In this paper, we consider the problem of constructing privacy-preserving K -means clustering protocol for data analysis that provides local...
Saved in:
Main Authors: | , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/168038 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Privacy-preserving data analysis is an emerging area that addresses the dilemma of performing data analysis on user data while protecting users' privacy. In this paper, we consider the problem of constructing privacy-preserving K -means clustering protocol for data analysis that provides local privacy to users' data. To enable a desirable degree of local privacy guarantee while maintaining high accuracy of the clustering, we adopt a generalized differential privacy definition, dχ -privacy, which quantifies the distinguishability level based on the distance between data records defined by the distance function dχ. In our work, we consider the space of data points as a metric space imbued with Euclidean distance and propose a bounded perturbation mechanism (BPM) with bounded sampling space of the perturbed data points, which is formally shown to achieve dχ -privacy. BPM perturbs the data as a whole instead of treating each dimension independently, which is desirable since the privacy budget is no longer required to be split among different dimensions. Bounded output space also means that we will not get into the case where the report or the statistical result is so far out of the data domain that it is hard to interpret. Furthermore, it can also help in limiting the amount of bandwidth needed to send such report to the server. The design of BPM is based on a probability density function which decreases exponentially as the Euclidean distance with respect to the true value grows. It is also designed with the aim of ensuring that BPM produces perturbed data that provides the claimed privacy guarantee while ensuring high utility response. To guarantee the efficiency of the perturbation method, we propose an efficient algorithm to sample from the proposed distribution and apply BPM to the design of dχ -private K -means clustering algorithms. Lastly, we analyse the privacy and utility guarantee provided by the proposed method and provide its experimental results. |
---|