Web crawler for newspaper text

There is a huge collection of news related data available electronically today because of the World Wide Web. Web crawling has provided an avenue for those interested in obtaining these data, and to train language models that can be improved upon as more data are collected. However, every website is...

Full description

Saved in:

Bibliographic Details
Main Author:	Phuah, Chee Chong
Other Authors:	Chng Eng Siong
Format:	Final Year Project
Language:	English
Published:	2015
Subjects:	DRNTU::Engineering::Computer science and engineering
Online Access:	http://hdl.handle.net/10356/62822
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-62822
record_format	dspace
spelling	sg-ntu-dr.10356-628222023-03-03T20:24:31Z Web crawler for newspaper text Phuah, Chee Chong Chng Eng Siong School of Computer Engineering Emerging Research Lab DRNTU::Engineering::Computer science and engineering There is a huge collection of news related data available electronically today because of the World Wide Web. Web crawling has provided an avenue for those interested in obtaining these data, and to train language models that can be improved upon as more data are collected. However, every website is developed differently and extracting specific parts of each website for information can result in major rework of the web crawler. Many existing web crawlers today do not facilitate multiple web crawling nor do they specifically allow parts of a web page to be selected. The primary objective of this project is to develop a web crawler that is able to crawl multiple news website with minimal modifications whenever more websites need to be added. This is achieved by realising 4 software quality attributes – reusability, modularity, portability and scalability. The web crawler is developed in Python with external libraries that improve the efficiency and performance of its web crawling process. The web crawler developed is capable of crawling multiple news website in multiple languages (e.g. English, Malay and Vietnamese) with selection policies unique to each website. The selection policies are used for identifying specific links (of where data is to be extracted) and content selection. Data extracted is also stored into XML files with custom tags so that regardless of how differently each website is developed, the extracted content will be in a standardised format after extraction. In addition to the web crawler, a Text Normalisation module was developed separately in this project to examine the quality of the data extracted. The Text Normalisation module normalises text into a format for a language modelling toolkit to train a language model base on the normalised text. The same toolkit is used to test data against the trained language model which produced a perplexity value. The perplexity found for each test data used showed a similar pattern – when the date of the language model moves closer to the date of the test data, the perplexity gradually decreases. The overall perplexity is also found to be lower whenever more data is used to train the language model. The results highlighted the need for relevant and latest data from news websites to train a news type language model. Bachelor of Engineering (Computer Science) 2015-04-29T07:44:28Z 2015-04-29T07:44:28Z 2015 2015 Final Year Project (FYP) http://hdl.handle.net/10356/62822 en Nanyang Technological University 57 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering
spellingShingle	DRNTU::Engineering::Computer science and engineering Phuah, Chee Chong Web crawler for newspaper text
description	There is a huge collection of news related data available electronically today because of the World Wide Web. Web crawling has provided an avenue for those interested in obtaining these data, and to train language models that can be improved upon as more data are collected. However, every website is developed differently and extracting specific parts of each website for information can result in major rework of the web crawler. Many existing web crawlers today do not facilitate multiple web crawling nor do they specifically allow parts of a web page to be selected. The primary objective of this project is to develop a web crawler that is able to crawl multiple news website with minimal modifications whenever more websites need to be added. This is achieved by realising 4 software quality attributes – reusability, modularity, portability and scalability. The web crawler is developed in Python with external libraries that improve the efficiency and performance of its web crawling process. The web crawler developed is capable of crawling multiple news website in multiple languages (e.g. English, Malay and Vietnamese) with selection policies unique to each website. The selection policies are used for identifying specific links (of where data is to be extracted) and content selection. Data extracted is also stored into XML files with custom tags so that regardless of how differently each website is developed, the extracted content will be in a standardised format after extraction. In addition to the web crawler, a Text Normalisation module was developed separately in this project to examine the quality of the data extracted. The Text Normalisation module normalises text into a format for a language modelling toolkit to train a language model base on the normalised text. The same toolkit is used to test data against the trained language model which produced a perplexity value. The perplexity found for each test data used showed a similar pattern – when the date of the language model moves closer to the date of the test data, the perplexity gradually decreases. The overall perplexity is also found to be lower whenever more data is used to train the language model. The results highlighted the need for relevant and latest data from news websites to train a news type language model.
author2	Chng Eng Siong
author_facet	Chng Eng Siong Phuah, Chee Chong
format	Final Year Project
author	Phuah, Chee Chong
author_sort	Phuah, Chee Chong
title	Web crawler for newspaper text
title_short	Web crawler for newspaper text
title_full	Web crawler for newspaper text
title_fullStr	Web crawler for newspaper text
title_full_unstemmed	Web crawler for newspaper text
title_sort	web crawler for newspaper text
publishDate	2015
url	http://hdl.handle.net/10356/62822
_version_	1759854587806744576

Web crawler for newspaper text

Similar Items