Autolex: An automatic lexicon builder for minority languages using an open corpus
The aim of this study is to build natural language resources for languages with limited resources or minority languages. Manually building these resources is tedious and costly. These natural language resources such as a language corpora and lexicon will be used for natural language processing resea...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | text |
Published: |
Animo Repository
2010
|
Subjects: | |
Online Access: | https://animorepository.dlsu.edu.ph/faculty_research/4041 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | De La Salle University |
id |
oai:animorepository.dlsu.edu.ph:faculty_research-4944 |
---|---|
record_format |
eprints |
spelling |
oai:animorepository.dlsu.edu.ph:faculty_research-49442021-08-13T00:23:17Z Autolex: An automatic lexicon builder for minority languages using an open corpus Buhay, Evan Liz C. Evardone, Marie Joy P. Nocon, Hansel B. Dimalen, Davis Muhajereen D. Roxas, Rachel Edita O. The aim of this study is to build natural language resources for languages with limited resources or minority languages. Manually building these resources is tedious and costly. These natural language resources such as a language corpora and lexicon will be used for natural language processing research and system development. Tagalog, a minority language was considered in this study as a test bed. This study exploited the use of the WWW to retrieve documents that are written in a minority language. We employed a frequency-based algorithm to build the lexicon. For our evaluation, we considered 260 Tagalog documents extracted from the web as our corpus. From the corpus, the system automatically selected 1,386 candidate unique words based on the threshold (with value of 10) as the lexical entries. Each lexical entry is validated by a language expert. Our evaluation shows an accuracy of 97.84% and only 2.16% error rate. The error was based on incorrectly spelled words or words that are not Tagalog. 2010-12-01T08:00:00Z text https://animorepository.dlsu.edu.ph/faculty_research/4041 Faculty Research Work Animo Repository Lexicography—Data processing Computational linguistics Computer Sciences |
institution |
De La Salle University |
building |
De La Salle University Library |
continent |
Asia |
country |
Philippines Philippines |
content_provider |
De La Salle University Library |
collection |
DLSU Institutional Repository |
topic |
Lexicography—Data processing Computational linguistics Computer Sciences |
spellingShingle |
Lexicography—Data processing Computational linguistics Computer Sciences Buhay, Evan Liz C. Evardone, Marie Joy P. Nocon, Hansel B. Dimalen, Davis Muhajereen D. Roxas, Rachel Edita O. Autolex: An automatic lexicon builder for minority languages using an open corpus |
description |
The aim of this study is to build natural language resources for languages with limited resources or minority languages. Manually building these resources is tedious and costly. These natural language resources such as a language corpora and lexicon will be used for natural language processing research and system development. Tagalog, a minority language was considered in this study as a test bed. This study exploited the use of the WWW to retrieve documents that are written in a minority language. We employed a frequency-based algorithm to build the lexicon. For our evaluation, we considered 260 Tagalog documents extracted from the web as our corpus. From the corpus, the system automatically selected 1,386 candidate unique words based on the threshold (with value of 10) as the lexical entries. Each lexical entry is validated by a language expert. Our evaluation shows an accuracy of 97.84% and only 2.16% error rate. The error was based on incorrectly spelled words or words that are not Tagalog. |
format |
text |
author |
Buhay, Evan Liz C. Evardone, Marie Joy P. Nocon, Hansel B. Dimalen, Davis Muhajereen D. Roxas, Rachel Edita O. |
author_facet |
Buhay, Evan Liz C. Evardone, Marie Joy P. Nocon, Hansel B. Dimalen, Davis Muhajereen D. Roxas, Rachel Edita O. |
author_sort |
Buhay, Evan Liz C. |
title |
Autolex: An automatic lexicon builder for minority languages using an open corpus |
title_short |
Autolex: An automatic lexicon builder for minority languages using an open corpus |
title_full |
Autolex: An automatic lexicon builder for minority languages using an open corpus |
title_fullStr |
Autolex: An automatic lexicon builder for minority languages using an open corpus |
title_full_unstemmed |
Autolex: An automatic lexicon builder for minority languages using an open corpus |
title_sort |
autolex: an automatic lexicon builder for minority languages using an open corpus |
publisher |
Animo Repository |
publishDate |
2010 |
url |
https://animorepository.dlsu.edu.ph/faculty_research/4041 |
_version_ |
1767196014491467776 |