Online text mining for conversational speech recognition

Conversational text is a highly varied, and many abbreviations and short forms exist in different languages. To manually enter every single possible term would be difficult, and chances are that certain terms would be missed out. This makes the compilation of conversational texts a difficult task. T...

Full description

Saved in:

Bibliographic Details
Main Author:	Thong, Kian Hoong.
Other Authors:	School of Computer Engineering
Format:	Final Year Project
Language:	English
Published:	2013
Subjects:	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
Online Access:	http://hdl.handle.net/10356/55014
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-55014
record_format	dspace
spelling	sg-ntu-dr.10356-550142023-03-03T20:33:10Z Online text mining for conversational speech recognition Thong, Kian Hoong. School of Computer Engineering Centre for Advanced Information Systems Chng Eng Siong DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Conversational text is a highly varied, and many abbreviations and short forms exist in different languages. To manually enter every single possible term would be difficult, and chances are that certain terms would be missed out. This makes the compilation of conversational texts a difficult task. This project aims to utilize cutting-edge search engines of today, like Google and Bing, to crawl the web for conversational texts to add to the Language Model. It also utilizes certain methods to minimize the clutter that’s present in the final text that will be input into the Language Model. Much research was done into understanding the three aspects of this project, namely: Web-crawling, normalization and language modeling. Relying on academic literature and the internet, the web-crawler was developed to fulfill the needs of obtaining a conversational corpus. It uses filtering and history tracking to ensure that the data is readable and non-repeated. At the conclusion of this project, a substantial amount of data was collected from the Internet, using a combination of normalization techniques and APIs used for web-crawling. The data was then used to generate a language model which was run against the test data. The resulting perplexity would entail if the crawled data would have an improved perplexity over the manually transcribed training data. This report contains all the research and data used to optimize the search engine program, as well as reflections of lessons learnt throughout this process. Bachelor of Engineering (Computer Science) 2013-11-29T06:52:08Z 2013-11-29T06:52:08Z 2013 2013 Final Year Project (FYP) http://hdl.handle.net/10356/55014 en Nanyang Technological University 44 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
spellingShingle	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Thong, Kian Hoong. Online text mining for conversational speech recognition
description	Conversational text is a highly varied, and many abbreviations and short forms exist in different languages. To manually enter every single possible term would be difficult, and chances are that certain terms would be missed out. This makes the compilation of conversational texts a difficult task. This project aims to utilize cutting-edge search engines of today, like Google and Bing, to crawl the web for conversational texts to add to the Language Model. It also utilizes certain methods to minimize the clutter that’s present in the final text that will be input into the Language Model. Much research was done into understanding the three aspects of this project, namely: Web-crawling, normalization and language modeling. Relying on academic literature and the internet, the web-crawler was developed to fulfill the needs of obtaining a conversational corpus. It uses filtering and history tracking to ensure that the data is readable and non-repeated. At the conclusion of this project, a substantial amount of data was collected from the Internet, using a combination of normalization techniques and APIs used for web-crawling. The data was then used to generate a language model which was run against the test data. The resulting perplexity would entail if the crawled data would have an improved perplexity over the manually transcribed training data. This report contains all the research and data used to optimize the search engine program, as well as reflections of lessons learnt throughout this process.
author2	School of Computer Engineering
author_facet	School of Computer Engineering Thong, Kian Hoong.
format	Final Year Project
author	Thong, Kian Hoong.
author_sort	Thong, Kian Hoong.
title	Online text mining for conversational speech recognition
title_short	Online text mining for conversational speech recognition
title_full	Online text mining for conversational speech recognition
title_fullStr	Online text mining for conversational speech recognition
title_full_unstemmed	Online text mining for conversational speech recognition
title_sort	online text mining for conversational speech recognition
publishDate	2013
url	http://hdl.handle.net/10356/55014
_version_	1759856461703282688

Online text mining for conversational speech recognition

Similar Items