Online text mining

Training language model made from conversational speech is difficult due to large variation of the way conversational speech is made. Deriving the conversational speech through direct transcription is costly and impractical for large corpus. A solution to this is to utilize the text that is easily a...

Full description

Saved in:
Bibliographic Details
Main Author: Tan, Abel Peng Heng
Other Authors: Chng Eng Siong
Format: Final Year Project
Language:English
Published: 2014
Subjects:
Online Access:http://hdl.handle.net/10356/61577
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-61577
record_format dspace
spelling sg-ntu-dr.10356-615772023-03-03T20:41:02Z Online text mining Tan, Abel Peng Heng Chng Eng Siong School of Computer Engineering Emerging Research Lab DRNTU::Engineering::Computer science and engineering::Computer applications::Administrative data processing Training language model made from conversational speech is difficult due to large variation of the way conversational speech is made. Deriving the conversational speech through direct transcription is costly and impractical for large corpus. A solution to this is to utilize the text that is easily available on the Internet to improve an existing language model made from broadcast news. In this project, the author developed an automated application capable of mining for text from online source, transforming the data into human speak-able forms through normalization techniques before using them to generate language model for adaption to improve the existing language model. The system developed had mined 16 years of data mounting up to 1.69 GB of new articles text. Through smoothing technique and interpolation weight analysis, the author had improved the perplexity of existing system made from broadcast news text significantly by 48.5%. The previous finding had shown that a larger corpus could improve the perplexity of a language model however there is a constant need to find out better ways to make use of data instead of massively crawling data across the Internet.Thus the second objective of this project is to investigate the effectiveness of improving language model based on latest data. To find out whether constantly crawling for new data is good for improving a language model or the change in perplexity is so small that it is negligible. In this report, the author had conducted 8 experiments using 10 years of past data to gather a baseline perplexity. Each of the experiments had new data added to the base model before a perplexity test is again carried out. The findings showed that despite the new data being only an average of 0.18% of the baseline’straining data. It had an average of 1.9% perplexity improvement.improvement. Therefore the author is able to conclude that new data is of high importance and should always be crawled to improve a language model if its usage patterns changes according to new data. The final experiment of this project had created language models using a variation of vocabulary sizes. Tests on them revealed that an increased in vocabulary size actually increases the perplexity. Bachelor of Engineering (Computer Science) 2014-06-16T01:42:55Z 2014-06-16T01:42:55Z 2014 2014 Final Year Project (FYP) http://hdl.handle.net/10356/61577 en Nanyang Technological University 54 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Computer applications::Administrative data processing
spellingShingle DRNTU::Engineering::Computer science and engineering::Computer applications::Administrative data processing
Tan, Abel Peng Heng
Online text mining
description Training language model made from conversational speech is difficult due to large variation of the way conversational speech is made. Deriving the conversational speech through direct transcription is costly and impractical for large corpus. A solution to this is to utilize the text that is easily available on the Internet to improve an existing language model made from broadcast news. In this project, the author developed an automated application capable of mining for text from online source, transforming the data into human speak-able forms through normalization techniques before using them to generate language model for adaption to improve the existing language model. The system developed had mined 16 years of data mounting up to 1.69 GB of new articles text. Through smoothing technique and interpolation weight analysis, the author had improved the perplexity of existing system made from broadcast news text significantly by 48.5%. The previous finding had shown that a larger corpus could improve the perplexity of a language model however there is a constant need to find out better ways to make use of data instead of massively crawling data across the Internet.Thus the second objective of this project is to investigate the effectiveness of improving language model based on latest data. To find out whether constantly crawling for new data is good for improving a language model or the change in perplexity is so small that it is negligible. In this report, the author had conducted 8 experiments using 10 years of past data to gather a baseline perplexity. Each of the experiments had new data added to the base model before a perplexity test is again carried out. The findings showed that despite the new data being only an average of 0.18% of the baseline’straining data. It had an average of 1.9% perplexity improvement.improvement. Therefore the author is able to conclude that new data is of high importance and should always be crawled to improve a language model if its usage patterns changes according to new data. The final experiment of this project had created language models using a variation of vocabulary sizes. Tests on them revealed that an increased in vocabulary size actually increases the perplexity.
author2 Chng Eng Siong
author_facet Chng Eng Siong
Tan, Abel Peng Heng
format Final Year Project
author Tan, Abel Peng Heng
author_sort Tan, Abel Peng Heng
title Online text mining
title_short Online text mining
title_full Online text mining
title_fullStr Online text mining
title_full_unstemmed Online text mining
title_sort online text mining
publishDate 2014
url http://hdl.handle.net/10356/61577
_version_ 1759855072331694080