Online text mining

Training language model made from conversational speech is difficult due to large variation of the way conversational speech is made. Deriving the conversational speech through direct transcription is costly and impractical for large corpus. A solution to this is to utilize the text that is easily a...

Full description

Saved in:

Bibliographic Details
Main Author:	Tan, Abel Peng Heng
Other Authors:	Chng Eng Siong
Format:	Final Year Project
Language:	English
Published:	2014
Subjects:	DRNTU::Engineering::Computer science and engineering::Computer applications::Administrative data processing
Online Access:	http://hdl.handle.net/10356/61577
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-61577
record_format	dspace
spelling	sg-ntu-dr.10356-615772023-03-03T20:41:02Z Online text mining Tan, Abel Peng Heng Chng Eng Siong School of Computer Engineering Emerging Research Lab DRNTU::Engineering::Computer science and engineering::Computer applications::Administrative data processing Training language model made from conversational speech is difficult due to large variation of the way conversational speech is made. Deriving the conversational speech through direct transcription is costly and impractical for large corpus. A solution to this is to utilize the text that is easily available on the Internet to improve an existing language model made from broadcast news. In this project, the author developed an automated application capable of mining for text from online source, transforming the data into human speak-able forms through normalization techniques before using them to generate language model for adaption to improve the existing language model. The system developed had mined 16 years of data mounting up to 1.69 GB of new articles text. Through smoothing technique and interpolation weight analysis, the author had improved the perplexity of existing system made from broadcast news text significantly by 48.5%. The previous finding had shown that a larger corpus could improve the perplexity of a language model however there is a constant need to find out better ways to make use of data instead of massively crawling data across the Internet.Thus the second objective of this project is to investigate the effectiveness of improving language model based on latest data. To find out whether constantly crawling for new data is good for improving a language model or the change in perplexity is so small that it is negligible. In this report, the author had conducted 8 experiments using 10 years of past data to gather a baseline perplexity. Each of the experiments had new data added to the base model before a perplexity test is again carried out. The findings showed that despite the new data being only an average of 0.18% of the baseline’straining data. It had an average of 1.9% perplexity improvement.improvement. Therefore the author is able to conclude that new data is of high importance and should always be crawled to improve a language model if its usage patterns changes according to new data. The final experiment of this project had created language models using a variation of vocabulary sizes. Tests on them revealed that an increased in vocabulary size actually increases the perplexity. Bachelor of Engineering (Computer Science) 2014-06-16T01:42:55Z 2014-06-16T01:42:55Z 2014 2014 Final Year Project (FYP) http://hdl.handle.net/10356/61577 en Nanyang Technological University 54 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Computer applications::Administrative data processing
spellingShingle	DRNTU::Engineering::Computer science and engineering::Computer applications::Administrative data processing Tan, Abel Peng Heng Online text mining
description	Training language model made from conversational speech is difficult due to large variation of the way conversational speech is made. Deriving the conversational speech through direct transcription is costly and impractical for large corpus. A solution to this is to utilize the text that is easily available on the Internet to improve an existing language model made from broadcast news. In this project, the author developed an automated application capable of mining for text from online source, transforming the data into human speak-able forms through normalization techniques before using them to generate language model for adaption to improve the existing language model. The system developed had mined 16 years of data mounting up to 1.69 GB of new articles text. Through smoothing technique and interpolation weight analysis, the author had improved the perplexity of existing system made from broadcast news text significantly by 48.5%. The previous finding had shown that a larger corpus could improve the perplexity of a language model however there is a constant need to find out better ways to make use of data instead of massively crawling data across the Internet.Thus the second objective of this project is to investigate the effectiveness of improving language model based on latest data. To find out whether constantly crawling for new data is good for improving a language model or the change in perplexity is so small that it is negligible. In this report, the author had conducted 8 experiments using 10 years of past data to gather a baseline perplexity. Each of the experiments had new data added to the base model before a perplexity test is again carried out. The findings showed that despite the new data being only an average of 0.18% of the baseline’straining data. It had an average of 1.9% perplexity improvement.improvement. Therefore the author is able to conclude that new data is of high importance and should always be crawled to improve a language model if its usage patterns changes according to new data. The final experiment of this project had created language models using a variation of vocabulary sizes. Tests on them revealed that an increased in vocabulary size actually increases the perplexity.
author2	Chng Eng Siong
author_facet	Chng Eng Siong Tan, Abel Peng Heng
format	Final Year Project
author	Tan, Abel Peng Heng
author_sort	Tan, Abel Peng Heng
title	Online text mining
title_short	Online text mining
title_full	Online text mining
title_fullStr	Online text mining
title_full_unstemmed	Online text mining
title_sort	online text mining
publishDate	2014
url	http://hdl.handle.net/10356/61577
_version_	1759855072331694080

Online text mining

Similar Items