Web crawler for newspaper text

There is a huge collection of news related data available electronically today because of the World Wide Web. Web crawling has provided an avenue for those interested in obtaining these data, and to train language models that can be improved upon as more data are collected. However, every website is...

Full description

Saved in:
Bibliographic Details
Main Author: Phuah, Chee Chong
Other Authors: Chng Eng Siong
Format: Final Year Project
Language:English
Published: 2015
Subjects:
Online Access:http://hdl.handle.net/10356/62822
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:There is a huge collection of news related data available electronically today because of the World Wide Web. Web crawling has provided an avenue for those interested in obtaining these data, and to train language models that can be improved upon as more data are collected. However, every website is developed differently and extracting specific parts of each website for information can result in major rework of the web crawler. Many existing web crawlers today do not facilitate multiple web crawling nor do they specifically allow parts of a web page to be selected. The primary objective of this project is to develop a web crawler that is able to crawl multiple news website with minimal modifications whenever more websites need to be added. This is achieved by realising 4 software quality attributes – reusability, modularity, portability and scalability. The web crawler is developed in Python with external libraries that improve the efficiency and performance of its web crawling process. The web crawler developed is capable of crawling multiple news website in multiple languages (e.g. English, Malay and Vietnamese) with selection policies unique to each website. The selection policies are used for identifying specific links (of where data is to be extracted) and content selection. Data extracted is also stored into XML files with custom tags so that regardless of how differently each website is developed, the extracted content will be in a standardised format after extraction. In addition to the web crawler, a Text Normalisation module was developed separately in this project to examine the quality of the data extracted. The Text Normalisation module normalises text into a format for a language modelling toolkit to train a language model base on the normalised text. The same toolkit is used to test data against the trained language model which produced a perplexity value. The perplexity found for each test data used showed a similar pattern – when the date of the language model moves closer to the date of the test data, the perplexity gradually decreases. The overall perplexity is also found to be lower whenever more data is used to train the language model. The results highlighted the need for relevant and latest data from news websites to train a news type language model.