AutoCor: Automatic acquisition of corpora of closely-related languages

AutoCor is a method for the automatic acquisition and classification of corpora of documents in closely-related languages. It is an extension and enhancement of CorpusBuilder, a system that automatically builds specific minority language corpora from a closed corpus (Ghani, et al, 2001a). AutoCor us...

Full description

Saved in:

Bibliographic Details
Main Author:	Dimalen, Davis Muhajereen D.
Format:	text
Language:	English
Published:	Animo Repository 2004
Subjects:	Query languages (Computer science) Corpora (Linguistics) Machine translating Computational linguistics QUERY (Information retrieval system) Language and languages Computer Sciences
Online Access:	https://animorepository.dlsu.edu.ph/etd_masteral/3185 https://animorepository.dlsu.edu.ph/context/etd_masteral/article/10023/viewcontent/CDTG003719_P__1_.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	De La Salle University
Language:	English

id	oai:animorepository.dlsu.edu.ph:etd_masteral-10023
record_format	eprints
spelling	oai:animorepository.dlsu.edu.ph:etd_masteral-100232023-05-25T08:43:19Z AutoCor: Automatic acquisition of corpora of closely-related languages Dimalen, Davis Muhajereen D. AutoCor is a method for the automatic acquisition and classification of corpora of documents in closely-related languages. It is an extension and enhancement of CorpusBuilder, a system that automatically builds specific minority language corpora from a closed corpus (Ghani, et al, 2001a). AutoCor used the query generation method odds ratio which was reported to produce best results in CorpusBuilder. It considered closely-related languages rather than a single minority language, and introduced the concept of common word pruning to the language models of closely-related languages, which was found to improve the precision of the system. The method was implemented in PHP and PERL & tested on 3 most closely-related languages in the Philippines, namely: Bicolano, Cebuano and Tagalog (Fortunato, 1993). Each of the target languages was tested for query lengths 1 to 5, with 100 generated queries per query length, both with and without pruning. Precision and recall were computed per query, and average precision was computed per query length. The results show that common word pruning improved the precision of the system (Bicolano: with 52.96% highest improvement at query length 4, Cebuano: with 18.00% highest improvement at query length 1, Tagalog: with 19.78% highest improvement at query length 2). 2004-08-05T07:00:00Z text application/pdf https://animorepository.dlsu.edu.ph/etd_masteral/3185 https://animorepository.dlsu.edu.ph/context/etd_masteral/article/10023/viewcontent/CDTG003719_P__1_.pdf Master's Theses English Animo Repository Query languages (Computer science) Corpora (Linguistics) Machine translating Computational linguistics QUERY (Information retrieval system) Language and languages Computer Sciences
institution	De La Salle University
building	De La Salle University Library
continent	Asia
country	Philippines Philippines
content_provider	De La Salle University Library
collection	DLSU Institutional Repository
language	English
topic	Query languages (Computer science) Corpora (Linguistics) Machine translating Computational linguistics QUERY (Information retrieval system) Language and languages Computer Sciences
spellingShingle	Query languages (Computer science) Corpora (Linguistics) Machine translating Computational linguistics QUERY (Information retrieval system) Language and languages Computer Sciences Dimalen, Davis Muhajereen D. AutoCor: Automatic acquisition of corpora of closely-related languages
description	AutoCor is a method for the automatic acquisition and classification of corpora of documents in closely-related languages. It is an extension and enhancement of CorpusBuilder, a system that automatically builds specific minority language corpora from a closed corpus (Ghani, et al, 2001a). AutoCor used the query generation method odds ratio which was reported to produce best results in CorpusBuilder. It considered closely-related languages rather than a single minority language, and introduced the concept of common word pruning to the language models of closely-related languages, which was found to improve the precision of the system. The method was implemented in PHP and PERL & tested on 3 most closely-related languages in the Philippines, namely: Bicolano, Cebuano and Tagalog (Fortunato, 1993). Each of the target languages was tested for query lengths 1 to 5, with 100 generated queries per query length, both with and without pruning. Precision and recall were computed per query, and average precision was computed per query length. The results show that common word pruning improved the precision of the system (Bicolano: with 52.96% highest improvement at query length 4, Cebuano: with 18.00% highest improvement at query length 1, Tagalog: with 19.78% highest improvement at query length 2).
format	text
author	Dimalen, Davis Muhajereen D.
author_facet	Dimalen, Davis Muhajereen D.
author_sort	Dimalen, Davis Muhajereen D.
title	AutoCor: Automatic acquisition of corpora of closely-related languages
title_short	AutoCor: Automatic acquisition of corpora of closely-related languages
title_full	AutoCor: Automatic acquisition of corpora of closely-related languages
title_fullStr	AutoCor: Automatic acquisition of corpora of closely-related languages
title_full_unstemmed	AutoCor: Automatic acquisition of corpora of closely-related languages
title_sort	autocor: automatic acquisition of corpora of closely-related languages
publisher	Animo Repository
publishDate	2004
url	https://animorepository.dlsu.edu.ph/etd_masteral/3185 https://animorepository.dlsu.edu.ph/context/etd_masteral/article/10023/viewcontent/CDTG003719_P__1_.pdf
_version_	1767197088470269952

AutoCor: Automatic acquisition of corpora of closely-related languages

Similar Items