Online text mining
Training language model made from conversational speech is difficult due to large variation of the way conversational speech is made. Deriving the conversational speech through direct transcription is costly and impractical for large corpus. A solution to this is to utilize the text that is easily a...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
2014
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/61577 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-61577 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-615772023-03-03T20:41:02Z Online text mining Tan, Abel Peng Heng Chng Eng Siong School of Computer Engineering Emerging Research Lab DRNTU::Engineering::Computer science and engineering::Computer applications::Administrative data processing Training language model made from conversational speech is difficult due to large variation of the way conversational speech is made. Deriving the conversational speech through direct transcription is costly and impractical for large corpus. A solution to this is to utilize the text that is easily available on the Internet to improve an existing language model made from broadcast news. In this project, the author developed an automated application capable of mining for text from online source, transforming the data into human speak-able forms through normalization techniques before using them to generate language model for adaption to improve the existing language model. The system developed had mined 16 years of data mounting up to 1.69 GB of new articles text. Through smoothing technique and interpolation weight analysis, the author had improved the perplexity of existing system made from broadcast news text significantly by 48.5%. The previous finding had shown that a larger corpus could improve the perplexity of a language model however there is a constant need to find out better ways to make use of data instead of massively crawling data across the Internet.Thus the second objective of this project is to investigate the effectiveness of improving language model based on latest data. To find out whether constantly crawling for new data is good for improving a language model or the change in perplexity is so small that it is negligible. In this report, the author had conducted 8 experiments using 10 years of past data to gather a baseline perplexity. Each of the experiments had new data added to the base model before a perplexity test is again carried out. The findings showed that despite the new data being only an average of 0.18% of the baseline’straining data. It had an average of 1.9% perplexity improvement.improvement. Therefore the author is able to conclude that new data is of high importance and should always be crawled to improve a language model if its usage patterns changes according to new data. The final experiment of this project had created language models using a variation of vocabulary sizes. Tests on them revealed that an increased in vocabulary size actually increases the perplexity. Bachelor of Engineering (Computer Science) 2014-06-16T01:42:55Z 2014-06-16T01:42:55Z 2014 2014 Final Year Project (FYP) http://hdl.handle.net/10356/61577 en Nanyang Technological University 54 p. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Computer science and engineering::Computer applications::Administrative data processing |
spellingShingle |
DRNTU::Engineering::Computer science and engineering::Computer applications::Administrative data processing Tan, Abel Peng Heng Online text mining |
description |
Training language model made from conversational speech is difficult due to large variation of the way conversational speech is made. Deriving the conversational speech through direct transcription is costly and impractical for large corpus. A solution to this is to utilize the text that is easily available on the Internet to improve an existing language model made from broadcast news. In this project, the author developed an automated application capable of mining for text from online source, transforming the data into human speak-able forms through normalization techniques before using them to generate language model for adaption to improve the existing language model. The system developed had mined 16 years of data mounting up to 1.69 GB of new articles text. Through smoothing technique and interpolation weight analysis, the author had improved the perplexity of existing system made from broadcast news text significantly by 48.5%.
The previous finding had shown that a larger corpus could improve the perplexity of a language model however there is a constant need to find out better ways to make use of data instead of massively crawling data across the Internet.Thus the second objective of this project is to investigate the effectiveness of improving language model based on latest data. To find out whether constantly crawling for new data is good for improving a language model or the change in perplexity is so small that it is negligible. In this report, the author had conducted 8 experiments using 10 years of past data to gather a baseline perplexity. Each of the experiments had new data added to the base model before a perplexity test is again carried out. The findings showed that despite the new data being only an average of 0.18% of the baseline’straining data. It had an average of 1.9% perplexity improvement.improvement. Therefore the author is able to conclude that new data is of high importance and should always be crawled to improve a language model if its usage patterns changes according to new data. The final experiment of this project had created language models using a variation of vocabulary sizes. Tests on them revealed that an increased in vocabulary size actually increases the perplexity. |
author2 |
Chng Eng Siong |
author_facet |
Chng Eng Siong Tan, Abel Peng Heng |
format |
Final Year Project |
author |
Tan, Abel Peng Heng |
author_sort |
Tan, Abel Peng Heng |
title |
Online text mining |
title_short |
Online text mining |
title_full |
Online text mining |
title_fullStr |
Online text mining |
title_full_unstemmed |
Online text mining |
title_sort |
online text mining |
publishDate |
2014 |
url |
http://hdl.handle.net/10356/61577 |
_version_ |
1759855072331694080 |