Developing web crawler and categorization of newspaper text

The automated categorization (or classiﬁcation) of texts into predeﬁned categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form on World Wide Web like online newspaper, magazines, catalogues, blogs, video transcripts, etc. Exi...

Full description

Saved in:

Bibliographic Details
Main Author:	Singh, Rakhi
Other Authors:	Chng Eng Siong
Format:	Final Year Project
Language:	English
Published:	2015
Subjects:	DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Online Access:	http://hdl.handle.net/10356/62888
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-62888
record_format	dspace
spelling	sg-ntu-dr.10356-628882023-03-03T20:47:44Z Developing web crawler and categorization of newspaper text Singh, Rakhi Chng Eng Siong School of Computer Engineering Emerging Research Lab DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing The automated categorization (or classiﬁcation) of texts into predeﬁned categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form on World Wide Web like online newspaper, magazines, catalogues, blogs, video transcripts, etc. Existing supervised machine-learning based text classification models available in this field faces the challenge of needing large corpus/dataset of labelled data to train the language models. An innovative approach to this problem is to utilize the already classified/categorised news articles that are easily available on the internet. For the scope of this project an English modular text crawler that can be extended to multiple languages and is capable of automatically crawling online newspaper archives, extracting new keywords, and categories is developed. The corpus is further smoothened and transformed into human-speakable forms by using appropriate language-specific normalisation techniques. The crawler has mined over 1.16GB of data ranging from 2006-2012. This normalised corpus is used to build bi-gram probability based statistical language models for each category. These single-label paradigm classifiers are then combined together to form a text classification model. A document can be assigned to multiple categories with certain degree of ranking, but in this project primary focus is on assigning the most probable category to each news article based on the lowest perplexity value (highest similarity). The classification model, built is more robust than most of its counterparts currently available. The system shows a high average accuracy rate of 99.37%, and an average precision of 98.75%, when perplexity tests were conducted with randomly chosen articles Bachelor of Engineering (Computer Science) 2015-04-30T07:23:07Z 2015-04-30T07:23:07Z 2015 2015 Final Year Project (FYP) http://hdl.handle.net/10356/62888 en Nanyang Technological University 46 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
spellingShingle	DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing Singh, Rakhi Developing web crawler and categorization of newspaper text
description	The automated categorization (or classiﬁcation) of texts into predeﬁned categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form on World Wide Web like online newspaper, magazines, catalogues, blogs, video transcripts, etc. Existing supervised machine-learning based text classification models available in this field faces the challenge of needing large corpus/dataset of labelled data to train the language models. An innovative approach to this problem is to utilize the already classified/categorised news articles that are easily available on the internet. For the scope of this project an English modular text crawler that can be extended to multiple languages and is capable of automatically crawling online newspaper archives, extracting new keywords, and categories is developed. The corpus is further smoothened and transformed into human-speakable forms by using appropriate language-specific normalisation techniques. The crawler has mined over 1.16GB of data ranging from 2006-2012. This normalised corpus is used to build bi-gram probability based statistical language models for each category. These single-label paradigm classifiers are then combined together to form a text classification model. A document can be assigned to multiple categories with certain degree of ranking, but in this project primary focus is on assigning the most probable category to each news article based on the lowest perplexity value (highest similarity). The classification model, built is more robust than most of its counterparts currently available. The system shows a high average accuracy rate of 99.37%, and an average precision of 98.75%, when perplexity tests were conducted with randomly chosen articles
author2	Chng Eng Siong
author_facet	Chng Eng Siong Singh, Rakhi
format	Final Year Project
author	Singh, Rakhi
author_sort	Singh, Rakhi
title	Developing web crawler and categorization of newspaper text
title_short	Developing web crawler and categorization of newspaper text
title_full	Developing web crawler and categorization of newspaper text
title_fullStr	Developing web crawler and categorization of newspaper text
title_full_unstemmed	Developing web crawler and categorization of newspaper text
title_sort	developing web crawler and categorization of newspaper text
publishDate	2015
url	http://hdl.handle.net/10356/62888
_version_	1759855385079971840

Developing web crawler and categorization of newspaper text

Similar Items