Web crawler for newspaper text
There is a huge collection of news related data available electronically today because of the World Wide Web. Web crawling has provided an avenue for those interested in obtaining these data, and to train language models that can be improved upon as more data are collected. However, every website is...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
2015
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/62822 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-62822 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-628222023-03-03T20:24:31Z Web crawler for newspaper text Phuah, Chee Chong Chng Eng Siong School of Computer Engineering Emerging Research Lab DRNTU::Engineering::Computer science and engineering There is a huge collection of news related data available electronically today because of the World Wide Web. Web crawling has provided an avenue for those interested in obtaining these data, and to train language models that can be improved upon as more data are collected. However, every website is developed differently and extracting specific parts of each website for information can result in major rework of the web crawler. Many existing web crawlers today do not facilitate multiple web crawling nor do they specifically allow parts of a web page to be selected. The primary objective of this project is to develop a web crawler that is able to crawl multiple news website with minimal modifications whenever more websites need to be added. This is achieved by realising 4 software quality attributes – reusability, modularity, portability and scalability. The web crawler is developed in Python with external libraries that improve the efficiency and performance of its web crawling process. The web crawler developed is capable of crawling multiple news website in multiple languages (e.g. English, Malay and Vietnamese) with selection policies unique to each website. The selection policies are used for identifying specific links (of where data is to be extracted) and content selection. Data extracted is also stored into XML files with custom tags so that regardless of how differently each website is developed, the extracted content will be in a standardised format after extraction. In addition to the web crawler, a Text Normalisation module was developed separately in this project to examine the quality of the data extracted. The Text Normalisation module normalises text into a format for a language modelling toolkit to train a language model base on the normalised text. The same toolkit is used to test data against the trained language model which produced a perplexity value. The perplexity found for each test data used showed a similar pattern – when the date of the language model moves closer to the date of the test data, the perplexity gradually decreases. The overall perplexity is also found to be lower whenever more data is used to train the language model. The results highlighted the need for relevant and latest data from news websites to train a news type language model. Bachelor of Engineering (Computer Science) 2015-04-29T07:44:28Z 2015-04-29T07:44:28Z 2015 2015 Final Year Project (FYP) http://hdl.handle.net/10356/62822 en Nanyang Technological University 57 p. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Computer science and engineering |
spellingShingle |
DRNTU::Engineering::Computer science and engineering Phuah, Chee Chong Web crawler for newspaper text |
description |
There is a huge collection of news related data available electronically today because of the World Wide Web. Web crawling has provided an avenue for those interested in obtaining these data, and to train language models that can be improved upon as more data are collected. However, every website is developed differently and extracting specific parts of each website for information can result in major rework of the web crawler. Many existing web crawlers today do not facilitate multiple web crawling nor do they specifically allow parts of a web page to be selected. The primary objective of this project is to develop a web crawler that is able to crawl multiple news website with minimal modifications whenever more websites need to be added. This is achieved by realising 4 software quality attributes – reusability, modularity, portability and scalability. The web crawler is developed in Python with external libraries that improve the efficiency and performance of its web crawling process. The web crawler developed is capable of crawling multiple news website in multiple languages (e.g. English, Malay and Vietnamese) with selection policies unique to each website. The selection policies are used for identifying specific links (of where data is to be extracted) and content selection. Data extracted is also stored into XML files with custom tags so that regardless of how differently each website is developed, the extracted content will be in a standardised format after extraction. In addition to the web crawler, a Text Normalisation module was developed separately in this project to examine the quality of the data extracted. The Text Normalisation module normalises text into a format for a language modelling toolkit to train a language model base on the normalised text. The same toolkit is used to test data against the trained language model which produced a perplexity value. The perplexity found for each test data used showed a similar pattern – when the date of the language model moves closer to the date of the test data, the perplexity gradually decreases. The overall perplexity is also found to be lower whenever more data is used to train the language model. The results highlighted the need for relevant and latest data from news websites to train a news type language model. |
author2 |
Chng Eng Siong |
author_facet |
Chng Eng Siong Phuah, Chee Chong |
format |
Final Year Project |
author |
Phuah, Chee Chong |
author_sort |
Phuah, Chee Chong |
title |
Web crawler for newspaper text |
title_short |
Web crawler for newspaper text |
title_full |
Web crawler for newspaper text |
title_fullStr |
Web crawler for newspaper text |
title_full_unstemmed |
Web crawler for newspaper text |
title_sort |
web crawler for newspaper text |
publishDate |
2015 |
url |
http://hdl.handle.net/10356/62822 |
_version_ |
1759854587806744576 |