Automatic lexicon extraction from comparable, non-parallel corpora

An automated approach of extracting bilingual lexicon (or dictionary) from comparable, non-parallel corpora is developed, implemented and tested. The corpora used are of bilingual domains containing 381,553 English and 92,610 Tagalog terms, with corresponding 4,817 and 3,421 distinct root words, res...

全面介紹

Saved in:

書目詳細資料
主要作者:	Tiu, Eileen Pamela K.
格式:	text
語言:	English
出版:	Animo Repository 2004
主題:	Lexicology > Data processing Computational linguistics Algorithms
在線閱讀:	https://animorepository.dlsu.edu.ph/etd_masteral/3173
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!

實物特徵
總結:	An automated approach of extracting bilingual lexicon (or dictionary) from comparable, non-parallel corpora is developed, implemented and tested. The corpora used are of bilingual domains containing 381,553 English and 92,610 Tagalog terms, with corresponding 4,817 and 3,421 distinct root words, respectively. The terms in the resulting lexicon are grouped into their respective senses. For the 100 test words (50 high frequency words, HFW and 50 low frequency words, LFW), 50.29% (HFW) and 31.37% (LFW) of the expected translations in all clusters were generated (called recall test). 56.12% (HFW) and 21.98% (LFW) of the expected translations within clusters were generated (called precision test). The overall results represented by the F-measure (a combination of recall and precision), show that 10.65% of the expected translations for the 100 test words were generated. Inclusion of several natural language resources (e.g. lexicon expansion to include alternate senses, word per word lexicon translation, larger comparable corpora), improvement of preprocessing techniques (e.g. stemming and part of speech tagging for Tagalog), and other enhancements (e.g. smoothing of sparse data and disambiguation techniques) would improve the overall performance of the system.

Automatic lexicon extraction from comparable, non-parallel corpora

相似書籍