Efficient singapore room rental search with data mining
The author wants to answer the question: how Data Mining techniques can be utilised to improve the efficiency of room rental search? With this, the first objective of this study is to develop a clustering method in the context of Singapore Room Rental listing retrieval, called Relevance-based Cluste...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2014
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/61615 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | The author wants to answer the question: how Data Mining techniques can be utilised to improve the efficiency of room rental search? With this, the first objective of this study is to develop a clustering method in the context of Singapore Room Rental listing retrieval, called Relevance-based Clustering. The proposed clustering method adds geographical relationship among the textual relevance search results.
The second objective is to develop a Rental Property Search Engine to demonstrate the result of applying Relevance-based Clustering to achieve efficient room rental search in Singapore. The essential part of this process is the ability to extract geographical information from webpages. The author narrows the scope of the study down to Singapore property websites, whereby the geographical information can be easily extracted from the map latitude and longitude information available in all of the major property websites in Singapore.
The rental property search engine is custom-coded by the author using Python 2.7 programming language and is being deployed on Google App Engine (GAE) cloud hosting platform.
The search engine consists of a property content web crawler that crawls rental section of Singapore property websites, and downloads content from each URL into the Listing table. Next, Data Pre-processing process is used to cleanse and tokenize the downloaded content to create and update into Inverted Index. Processed URLs are recorded into the Done-Process table to prevent duplicate effort.
Upon receiving user query input, the query text will be cleansed and tokenized by Query Parsing process before passing over to Scoring and Ranking process to convert
into vector form for Cosine Similarity score computation. The scoring will be ranked and the top K number of listings will form the Top K List.
The Top K List is used to compute the URL Spherical Distance Matrix and clustering is performed on the URL Spherical Distance Matrix to discover geographical relationship among the top K textual relevance listings. The clustered result is converted into HTML format and returned to the user.
The Information Retrieval (IR) effectiveness of the search engine based on K value = 100 has a low average F-Measure of 26%. Whereas, IR effectiveness based on K value = 20 has a better average F-Measure of 78%. |
---|