Autolex: An automatic lexicon builder for minority languages using an open corpus

The aim of this study is to build natural language resources for languages with limited resources or minority languages. Manually building these resources is tedious and costly. These natural language resources such as a language corpora and lexicon will be used for natural language processing resea...

Full description

Saved in:
Bibliographic Details
Main Authors: Buhay, Evan Liz C., Evardone, Marie Joy P., Nocon, Hansel B., Dimalen, Davis Muhajereen D., Roxas, Rachel Edita O.
Format: text
Published: Animo Repository 2010
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/faculty_research/4041
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
id oai:animorepository.dlsu.edu.ph:faculty_research-4944
record_format eprints
spelling oai:animorepository.dlsu.edu.ph:faculty_research-49442021-08-13T00:23:17Z Autolex: An automatic lexicon builder for minority languages using an open corpus Buhay, Evan Liz C. Evardone, Marie Joy P. Nocon, Hansel B. Dimalen, Davis Muhajereen D. Roxas, Rachel Edita O. The aim of this study is to build natural language resources for languages with limited resources or minority languages. Manually building these resources is tedious and costly. These natural language resources such as a language corpora and lexicon will be used for natural language processing research and system development. Tagalog, a minority language was considered in this study as a test bed. This study exploited the use of the WWW to retrieve documents that are written in a minority language. We employed a frequency-based algorithm to build the lexicon. For our evaluation, we considered 260 Tagalog documents extracted from the web as our corpus. From the corpus, the system automatically selected 1,386 candidate unique words based on the threshold (with value of 10) as the lexical entries. Each lexical entry is validated by a language expert. Our evaluation shows an accuracy of 97.84% and only 2.16% error rate. The error was based on incorrectly spelled words or words that are not Tagalog. 2010-12-01T08:00:00Z text https://animorepository.dlsu.edu.ph/faculty_research/4041 Faculty Research Work Animo Repository Lexicography—Data processing Computational linguistics Computer Sciences
institution De La Salle University
building De La Salle University Library
continent Asia
country Philippines
Philippines
content_provider De La Salle University Library
collection DLSU Institutional Repository
topic Lexicography—Data processing
Computational linguistics
Computer Sciences
spellingShingle Lexicography—Data processing
Computational linguistics
Computer Sciences
Buhay, Evan Liz C.
Evardone, Marie Joy P.
Nocon, Hansel B.
Dimalen, Davis Muhajereen D.
Roxas, Rachel Edita O.
Autolex: An automatic lexicon builder for minority languages using an open corpus
description The aim of this study is to build natural language resources for languages with limited resources or minority languages. Manually building these resources is tedious and costly. These natural language resources such as a language corpora and lexicon will be used for natural language processing research and system development. Tagalog, a minority language was considered in this study as a test bed. This study exploited the use of the WWW to retrieve documents that are written in a minority language. We employed a frequency-based algorithm to build the lexicon. For our evaluation, we considered 260 Tagalog documents extracted from the web as our corpus. From the corpus, the system automatically selected 1,386 candidate unique words based on the threshold (with value of 10) as the lexical entries. Each lexical entry is validated by a language expert. Our evaluation shows an accuracy of 97.84% and only 2.16% error rate. The error was based on incorrectly spelled words or words that are not Tagalog.
format text
author Buhay, Evan Liz C.
Evardone, Marie Joy P.
Nocon, Hansel B.
Dimalen, Davis Muhajereen D.
Roxas, Rachel Edita O.
author_facet Buhay, Evan Liz C.
Evardone, Marie Joy P.
Nocon, Hansel B.
Dimalen, Davis Muhajereen D.
Roxas, Rachel Edita O.
author_sort Buhay, Evan Liz C.
title Autolex: An automatic lexicon builder for minority languages using an open corpus
title_short Autolex: An automatic lexicon builder for minority languages using an open corpus
title_full Autolex: An automatic lexicon builder for minority languages using an open corpus
title_fullStr Autolex: An automatic lexicon builder for minority languages using an open corpus
title_full_unstemmed Autolex: An automatic lexicon builder for minority languages using an open corpus
title_sort autolex: an automatic lexicon builder for minority languages using an open corpus
publisher Animo Repository
publishDate 2010
url https://animorepository.dlsu.edu.ph/faculty_research/4041
_version_ 1767196014491467776