Automatic lexicon extraction from comparable, non-parallel corpora

An automated approach of extracting bilingual lexicon (or dictionary) from comparable, non-parallel corpora is developed, implemented and tested. The corpora used are of bilingual domains containing 381,553 English and 92,610 Tagalog terms, with corresponding 4,817 and 3,421 distinct root words, res...

Full description

Saved in:

Bibliographic Details
Main Author:	Tiu, Eileen Pamela K.
Format:	text
Language:	English
Published:	Animo Repository 2004
Subjects:	Lexicology > Data processing Computational linguistics Algorithms
Online Access:	https://animorepository.dlsu.edu.ph/etd_masteral/3173
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	De La Salle University
Language:	English

id	oai:animorepository.dlsu.edu.ph:etd_masteral-10011
record_format	eprints
spelling	oai:animorepository.dlsu.edu.ph:etd_masteral-100112023-05-24T03:38:00Z Automatic lexicon extraction from comparable, non-parallel corpora Tiu, Eileen Pamela K. An automated approach of extracting bilingual lexicon (or dictionary) from comparable, non-parallel corpora is developed, implemented and tested. The corpora used are of bilingual domains containing 381,553 English and 92,610 Tagalog terms, with corresponding 4,817 and 3,421 distinct root words, respectively. The terms in the resulting lexicon are grouped into their respective senses. For the 100 test words (50 high frequency words, HFW and 50 low frequency words, LFW), 50.29% (HFW) and 31.37% (LFW) of the expected translations in all clusters were generated (called recall test). 56.12% (HFW) and 21.98% (LFW) of the expected translations within clusters were generated (called precision test). The overall results represented by the F-measure (a combination of recall and precision), show that 10.65% of the expected translations for the 100 test words were generated. Inclusion of several natural language resources (e.g. lexicon expansion to include alternate senses, word per word lexicon translation, larger comparable corpora), improvement of preprocessing techniques (e.g. stemming and part of speech tagging for Tagalog), and other enhancements (e.g. smoothing of sparse data and disambiguation techniques) would improve the overall performance of the system. 2004-01-01T08:00:00Z text https://animorepository.dlsu.edu.ph/etd_masteral/3173 Master's Theses English Animo Repository Lexicology--Data processing Computational linguistics Algorithms
institution	De La Salle University
building	De La Salle University Library
continent	Asia
country	Philippines Philippines
content_provider	De La Salle University Library
collection	DLSU Institutional Repository
language	English
topic	Lexicology--Data processing Computational linguistics Algorithms
spellingShingle	Lexicology--Data processing Computational linguistics Algorithms Tiu, Eileen Pamela K. Automatic lexicon extraction from comparable, non-parallel corpora
description	An automated approach of extracting bilingual lexicon (or dictionary) from comparable, non-parallel corpora is developed, implemented and tested. The corpora used are of bilingual domains containing 381,553 English and 92,610 Tagalog terms, with corresponding 4,817 and 3,421 distinct root words, respectively. The terms in the resulting lexicon are grouped into their respective senses. For the 100 test words (50 high frequency words, HFW and 50 low frequency words, LFW), 50.29% (HFW) and 31.37% (LFW) of the expected translations in all clusters were generated (called recall test). 56.12% (HFW) and 21.98% (LFW) of the expected translations within clusters were generated (called precision test). The overall results represented by the F-measure (a combination of recall and precision), show that 10.65% of the expected translations for the 100 test words were generated. Inclusion of several natural language resources (e.g. lexicon expansion to include alternate senses, word per word lexicon translation, larger comparable corpora), improvement of preprocessing techniques (e.g. stemming and part of speech tagging for Tagalog), and other enhancements (e.g. smoothing of sparse data and disambiguation techniques) would improve the overall performance of the system.
format	text
author	Tiu, Eileen Pamela K.
author_facet	Tiu, Eileen Pamela K.
author_sort	Tiu, Eileen Pamela K.
title	Automatic lexicon extraction from comparable, non-parallel corpora
title_short	Automatic lexicon extraction from comparable, non-parallel corpora
title_full	Automatic lexicon extraction from comparable, non-parallel corpora
title_fullStr	Automatic lexicon extraction from comparable, non-parallel corpora
title_full_unstemmed	Automatic lexicon extraction from comparable, non-parallel corpora
title_sort	automatic lexicon extraction from comparable, non-parallel corpora
publisher	Animo Repository
publishDate	2004
url	https://animorepository.dlsu.edu.ph/etd_masteral/3173
_version_	1767197069388283904

Automatic lexicon extraction from comparable, non-parallel corpora

Similar Items