Dice's coefficient on trigram profiles as metric for language similarity
In this study, we present Dice's coefficient on trigram profiles as metric for language similarity. As testbed, we focused on eight Philippine languages. No known language similarity value for these languages exists. Documents containing transcribed audio recordings, news articles, religious an...
Saved in:
Main Authors: | , , , |
---|---|
Format: | text |
Published: |
Animo Repository
2013
|
Subjects: | |
Online Access: | https://animorepository.dlsu.edu.ph/faculty_research/2737 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | De La Salle University |
id |
oai:animorepository.dlsu.edu.ph:faculty_research-3736 |
---|---|
record_format |
eprints |
spelling |
oai:animorepository.dlsu.edu.ph:faculty_research-37362022-07-20T02:59:55Z Dice's coefficient on trigram profiles as metric for language similarity Oco, Nathaniel Syliongka, Leif Romeritch Roxas, Rachel Edita Ilao, Joel P. In this study, we present Dice's coefficient on trigram profiles as metric for language similarity. As testbed, we focused on eight Philippine languages. No known language similarity value for these languages exists. Documents containing transcribed audio recordings, news articles, religious and literary texts were taken from an online corpus and used as training data. Character trigram profiles were then generated using an n-gram generator and language similarity was computed. The results were matched against those reported in the literature and against the language family tree. To evaluate the metric, it was applied to five languages with known similarity values. The results were then compared with an existing lexical similarity metric. The average difference is 27%. Analyses of the results reveal that phonetic spelling play an important role in language similarity. As future work, the metric can be used on phonetic transcriptions. © 2013 IEEE. 2013-12-01T08:00:00Z text https://animorepository.dlsu.edu.ph/faculty_research/2737 Faculty Research Work Animo Repository Computational linguistics Similarity (Language learning) Philippine languages—Data processing Computer Sciences |
institution |
De La Salle University |
building |
De La Salle University Library |
continent |
Asia |
country |
Philippines Philippines |
content_provider |
De La Salle University Library |
collection |
DLSU Institutional Repository |
topic |
Computational linguistics Similarity (Language learning) Philippine languages—Data processing Computer Sciences |
spellingShingle |
Computational linguistics Similarity (Language learning) Philippine languages—Data processing Computer Sciences Oco, Nathaniel Syliongka, Leif Romeritch Roxas, Rachel Edita Ilao, Joel P. Dice's coefficient on trigram profiles as metric for language similarity |
description |
In this study, we present Dice's coefficient on trigram profiles as metric for language similarity. As testbed, we focused on eight Philippine languages. No known language similarity value for these languages exists. Documents containing transcribed audio recordings, news articles, religious and literary texts were taken from an online corpus and used as training data. Character trigram profiles were then generated using an n-gram generator and language similarity was computed. The results were matched against those reported in the literature and against the language family tree. To evaluate the metric, it was applied to five languages with known similarity values. The results were then compared with an existing lexical similarity metric. The average difference is 27%. Analyses of the results reveal that phonetic spelling play an important role in language similarity. As future work, the metric can be used on phonetic transcriptions. © 2013 IEEE. |
format |
text |
author |
Oco, Nathaniel Syliongka, Leif Romeritch Roxas, Rachel Edita Ilao, Joel P. |
author_facet |
Oco, Nathaniel Syliongka, Leif Romeritch Roxas, Rachel Edita Ilao, Joel P. |
author_sort |
Oco, Nathaniel |
title |
Dice's coefficient on trigram profiles as metric for language similarity |
title_short |
Dice's coefficient on trigram profiles as metric for language similarity |
title_full |
Dice's coefficient on trigram profiles as metric for language similarity |
title_fullStr |
Dice's coefficient on trigram profiles as metric for language similarity |
title_full_unstemmed |
Dice's coefficient on trigram profiles as metric for language similarity |
title_sort |
dice's coefficient on trigram profiles as metric for language similarity |
publisher |
Animo Repository |
publishDate |
2013 |
url |
https://animorepository.dlsu.edu.ph/faculty_research/2737 |
_version_ |
1738854831935717376 |