Frequent Lexicographic Algorithm for Mining Association Rules
The recent progress in computer storage technology have enable many organisations to collect and store a huge amount of data which is lead to growing demand for new techniques that can intelligently transform massive data into useful information and knowledge. The concept of data mining has brought...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2005
|
Subjects: | |
Online Access: | http://psasir.upm.edu.my/id/eprint/5857/1/FSKTM_2005_9%20IR.pdf http://psasir.upm.edu.my/id/eprint/5857/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Putra Malaysia |
Language: | English |
Summary: | The recent progress in computer storage technology have enable many organisations to collect and store a huge amount of data which is lead to growing demand for new
techniques that can intelligently transform massive data into useful information and knowledge. The concept of data mining has brought the attention of business community
in finding techniques that can extract nontrivial, implicit, previously unknown and potentially useful information from databases. Association rule mining is one of the data mining techniques which discovers strong association or correlation relationships among
data. The primary concept of association rule algorithms consist of two phase procedure. In the first phase, all frequent patterns are found and the second phase uses these
frequent patterns in order to generate all strong rules. The common precision measures used to complete these phases are support and confidence. Having been investigated
intensively during the past few years, it has been shown that the first phase involves a major computational task. Although the second phase seems to be more straightforward,
it can be costly because the size of the generated rules are normally large and in contrast only a small fraction of these rules are typically useful and important. As response to these challenges, this study is devoted towards finding faster methods for searching
frequent patterns and discovery of association rules in concise form. An algorithm called Flex (Frequent lexicographic patterns) has been proposed in obtaining a good performance of searching li-equent patterns. The algorithm involved the construction of the nodes of a lexicographic tree that represent frequent patterns. Depth
first strategy and vertical counting strategy are used in mining frequent patterns and computing the support of the patterns respectively. The mined frequent patterns are then used in generating association rules. Three models
were applied in this task which consist of traditional model, constraint model and representative model which produce three kinds of rules respectively; all association
rules, association rules with 1-consequence and representative rules. As an additional
utility in the representative model, this study proposed a set-theoretical intersection to
assist users in finding duplicated rules.
Four datasets from UCI machine learning repositories and domain theories except the
pumsb dataset were experimented. The Flex algorithm and the other two existing
algorithms Apriori and DIC under the same specification are tested toward these datasets
and their extraction times for mining frequent patterns were recorded and compared. The
experimental results showed that the proposed algorithm outperformed both existing algorithms especially for the case of long patterns. It also gave promising results in the
case of short patterns. Two of the datasets were then chosen for further experiment on
the scalability of the algorithms by increasing their size of transactions up to six times.
The scale-up experiment showed that the proposed algorithm is more scalable than the other existing algorithms.
The implementation of an adopted theory of representative model proved that this model is more concise than the other two models. It is shown by number of rules
generated from the chosen models. Besides a small set of rules obtained, the representative model also having the lossless information and soundness properties
meaning that it covers all interesting association rules and forbid derivation of weak
rules. It is theoretically proven that the proposed set-theoretical intersection is able to
assist users in knowing the duplication rules exist in representative model. |
---|