Enhanced word length and model elimination algorithms for language identification
Language identification is the process of determining the natural language of text documents using computational methods. The quality and size of the text available for generating the necessary models has significant impact on the performance of the algorithms used to determine the language of a tex...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2014
|
Subjects: | |
Online Access: | http://eprints.utm.my/id/eprint/78255/1/NicholasIornouguAkosuPFC2014.pdf http://eprints.utm.my/id/eprint/78255/ http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:98107 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Teknologi Malaysia |
Language: | English |
id |
my.utm.78255 |
---|---|
record_format |
eprints |
spelling |
my.utm.782552018-08-03T08:49:39Z http://eprints.utm.my/id/eprint/78255/ Enhanced word length and model elimination algorithms for language identification Akosu, Nicholas Iornongu QA75 Electronic computers. Computer science Language identification is the process of determining the natural language of text documents using computational methods. The quality and size of the text available for generating the necessary models has significant impact on the performance of the algorithms used to determine the language of a text. The ability to correctly identify the language of a document is required to ensure the effectiveness of information retrieval systems in a multilingual setting. Unfortunately, existing methods that are used to model natural language have been affected by several limitations. Such limitations include inability to produce reliable models given a small size of training text. Other limitations are: inability to consistently handle multilingual documents, long training times and inability to distinguish closely related languages. The spelling checker technique has been shown to be successful in distinguishing closely related languages but is often hampered by two important constraints: inefficient run time performance and non-availability of spelling checkers for many languages. The aim of this study is to address the problems of language identification by developing improved algorithms that enhance run time performance and accuracy irrespective of the size of corpus available. Therefore, this thesis proposed three algorithms. Firstly, the word length algorithm implements the bag-of-words model using word length information. Secondly, the model elimination algorithm is designed to further improve run time performance by taking advantage of word frequency in training and testing documents. By monitoring the performance of models in the course of processing, this algorithm dynamically selects non-performing models for elimination without compromising accuracy. Thirdly, the linear combination algorithm merges the strengths of the word length and model elimination algorithms by feeding word length features into the model elimination algorithm. Empirical results from the proposed algorithms using test collection from the standard corpora are superior to existing methods in terms of distinguishing closely related languages and multilingual identification. In addition, the word length, model elimination and the linear combination algorithms have better run time performance than the spelling checker method that uses a similar scoring technique, yielding average time gains of 57%, 83% and 98.4% respectively in identification of 140-byte long text . 2014-10 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/id/eprint/78255/1/NicholasIornouguAkosuPFC2014.pdf Akosu, Nicholas Iornongu (2014) Enhanced word length and model elimination algorithms for language identification. PhD thesis, Universiti Teknologi Malaysia, Faculty of Computing. http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:98107 |
institution |
Universiti Teknologi Malaysia |
building |
UTM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Teknologi Malaysia |
content_source |
UTM Institutional Repository |
url_provider |
http://eprints.utm.my/ |
language |
English |
topic |
QA75 Electronic computers. Computer science |
spellingShingle |
QA75 Electronic computers. Computer science Akosu, Nicholas Iornongu Enhanced word length and model elimination algorithms for language identification |
description |
Language identification is the process of determining the natural language of text documents using computational methods. The quality and size of the text available for generating the necessary models has significant impact on the performance of the algorithms used to determine the language of a text. The ability to correctly identify the language of a document is required to ensure the effectiveness of information retrieval systems in a multilingual setting. Unfortunately, existing methods that are used to model natural language have been affected by several limitations. Such limitations include inability to produce reliable models given a small size of training text. Other limitations are: inability to consistently handle multilingual documents, long training times and inability to distinguish closely related languages. The spelling checker technique has been shown to be successful in distinguishing closely related languages but is often hampered by two important constraints: inefficient run time performance and non-availability of spelling checkers for many languages. The aim of this study is to address the problems of language identification by developing improved algorithms that enhance run time performance and accuracy irrespective of the size of corpus available. Therefore, this thesis proposed three algorithms. Firstly, the word length algorithm implements the bag-of-words model using word length information. Secondly, the model elimination algorithm is designed to further improve run time performance by taking advantage of word frequency in training and testing documents. By monitoring the performance of models in the course of processing, this algorithm dynamically selects non-performing models for elimination without compromising accuracy. Thirdly, the linear combination algorithm merges the strengths of the word length and model elimination algorithms by feeding word length features into the model elimination algorithm. Empirical results from the proposed algorithms using test collection from the standard corpora are superior to existing methods in terms of distinguishing closely related languages and multilingual identification. In addition, the word length, model elimination and the linear combination algorithms have better run time performance than the spelling checker method that uses a similar scoring technique, yielding average time gains of 57%, 83% and 98.4% respectively in identification of 140-byte long text . |
format |
Thesis |
author |
Akosu, Nicholas Iornongu |
author_facet |
Akosu, Nicholas Iornongu |
author_sort |
Akosu, Nicholas Iornongu |
title |
Enhanced word length and model elimination algorithms for language identification |
title_short |
Enhanced word length and model elimination algorithms for language identification |
title_full |
Enhanced word length and model elimination algorithms for language identification |
title_fullStr |
Enhanced word length and model elimination algorithms for language identification |
title_full_unstemmed |
Enhanced word length and model elimination algorithms for language identification |
title_sort |
enhanced word length and model elimination algorithms for language identification |
publishDate |
2014 |
url |
http://eprints.utm.my/id/eprint/78255/1/NicholasIornouguAkosuPFC2014.pdf http://eprints.utm.my/id/eprint/78255/ http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:98107 |
_version_ |
1643657842659426304 |