Automatic bilingual lexicon extraction for a minority target language
An automated approach of extracting bilingual lexicon from comparable, nonparallel corpora was developed for a target language with limited linguistic resources. We combined approaches from previous researches which only concentrated on context extraction, clustering techniques, or usage of part of...
Saved in:
Main Authors: | , |
---|---|
Format: | text |
Published: |
Animo Repository
2008
|
Subjects: | |
Online Access: | https://animorepository.dlsu.edu.ph/faculty_research/4040 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | De La Salle University |
Summary: | An automated approach of extracting bilingual lexicon from comparable, nonparallel corpora was developed for a target language with limited linguistic resources. We combined approaches from previous researches which only concentrated on context extraction, clustering techniques, or usage of part of speech tags for defining the different senses of a word. The domain-specific corpora for the source language contain 381,553 English words, while the target language with minimal language resources contain 92,610 Tagalog word, with 4,817 and 3,421 distinct root words, respectively. Despite the use of limited amount of corpora (400k vs Sadat's (2003) 39M word corpora) and seed lexicon (9,026 entries vs Rapp's (1999) 16,380 entries), the evaluation yielded promising results. The 50 high and 50 low frequency words yielded 50.29% and 31.37% recall values, and 56.12% and 21.98% precision values, respectively, which are within the range of values from previous studies, 39 - 84.45% (Koehn et al., 2002 and Zhou et al., 2001). Ranking showed an improvement to overall F-measure from 7.32% to 10.65%. © 2007 by Eileen Pamela Tiu, and Rachel Edita O.Roxas. |
---|