Privacy preserving association rule mining

With the growing advancement in technology, amount of data generated is constantly increasing thus leading to the need for data mining technologies to mine valid patterns and relationships in large data sets. In connection with this dramatic increase in data and the popularity of data mining, issues...

Full description

Saved in:
Bibliographic Details
Main Author: Suruchi Sharma.
Other Authors: Ng Wee Keong
Format: Final Year Project
Language:English
Published: 2009
Subjects:
Online Access:http://hdl.handle.net/10356/16919
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:With the growing advancement in technology, amount of data generated is constantly increasing thus leading to the need for data mining technologies to mine valid patterns and relationships in large data sets. In connection with this dramatic increase in data and the popularity of data mining, issues about privacy preservation have become a great concern. Through this report, I intend to understand privacy preserving mining of association rules and to compare and contrast two randomization approaches to privacy preservation, namely cut‐andpaste randomization and MASK. Firstly, I looked at the process of data mining and its various classes like clustering, classification, prediction and association rule mining. I then looked at association rule mining in greater detail and described the Apriori algorithm for finding frequent itemsets. Following this, I looked at the techniques used by cut‐and‐paste randomization operator and MASK scheme to ensure privacy of the data bring used while accurately mining frequent itemsets from a set of randomized transactions. I implemented cut‐and‐paste and MASK in java using the client‐server architecture for communication in order to investigate their performance in terms of accuracy while maintaining privacy. I conducted several experimentations on the two schemes and found out that at 50% privacy levels, cut‐and‐paste randomization performed slightly better than MASK. However, since the difference in the results was not that that big, I concluded that both schemes performed equally well. I then pointed out certain limitations of the two schemes and explained the condition where these schemes were able to perform well.