Use of word and character N-grams for low-resourced local languages
Language identification is a text classification task for identifying the language of a given text. Several works use this as a preprocessing technique prior to sentiment analysis, mood analysis, and named entity recognition among others. Thus, building an accurate language identification engine is...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | text |
Published: |
Animo Repository
2019
|
Subjects: | |
Online Access: | https://animorepository.dlsu.edu.ph/faculty_research/3924 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | De La Salle University |
id |
oai:animorepository.dlsu.edu.ph:faculty_research-4907 |
---|---|
record_format |
eprints |
spelling |
oai:animorepository.dlsu.edu.ph:faculty_research-49072021-07-29T02:49:24Z Use of word and character N-grams for low-resourced local languages Regalado, Ralph Vincent Agarap, Abien Fred Baliber, Renz Iver Yambao, Arian Cheng, Charibeth Language identification is a text classification task for identifying the language of a given text. Several works use this as a preprocessing technique prior to sentiment analysis, mood analysis, and named entity recognition among others. Thus, building an accurate language identification engine is important given that the Philippines is home to more than 170 languages, and is scarce of language documents and resources. We compare machine learning algorithms such as Naive Bayes, Linear Support Vector Machines (SVM), and Random Forest for classification of Philippine languages. Results show that the Linear SVM model had the best performance with 0.97 Fl-score. © 2018 IEEE. 2019-01-28T08:00:00Z text https://animorepository.dlsu.edu.ph/faculty_research/3924 info:doi/10.1109/IALP.2018.8629235 Faculty Research Work Animo Repository Natural language processing (Computer science) Machine learning Computer Sciences |
institution |
De La Salle University |
building |
De La Salle University Library |
continent |
Asia |
country |
Philippines Philippines |
content_provider |
De La Salle University Library |
collection |
DLSU Institutional Repository |
topic |
Natural language processing (Computer science) Machine learning Computer Sciences |
spellingShingle |
Natural language processing (Computer science) Machine learning Computer Sciences Regalado, Ralph Vincent Agarap, Abien Fred Baliber, Renz Iver Yambao, Arian Cheng, Charibeth Use of word and character N-grams for low-resourced local languages |
description |
Language identification is a text classification task for identifying the language of a given text. Several works use this as a preprocessing technique prior to sentiment analysis, mood analysis, and named entity recognition among others. Thus, building an accurate language identification engine is important given that the Philippines is home to more than 170 languages, and is scarce of language documents and resources. We compare machine learning algorithms such as Naive Bayes, Linear Support Vector Machines (SVM), and Random Forest for classification of Philippine languages. Results show that the Linear SVM model had the best performance with 0.97 Fl-score. © 2018 IEEE. |
format |
text |
author |
Regalado, Ralph Vincent Agarap, Abien Fred Baliber, Renz Iver Yambao, Arian Cheng, Charibeth |
author_facet |
Regalado, Ralph Vincent Agarap, Abien Fred Baliber, Renz Iver Yambao, Arian Cheng, Charibeth |
author_sort |
Regalado, Ralph Vincent |
title |
Use of word and character N-grams for low-resourced local languages |
title_short |
Use of word and character N-grams for low-resourced local languages |
title_full |
Use of word and character N-grams for low-resourced local languages |
title_fullStr |
Use of word and character N-grams for low-resourced local languages |
title_full_unstemmed |
Use of word and character N-grams for low-resourced local languages |
title_sort |
use of word and character n-grams for low-resourced local languages |
publisher |
Animo Repository |
publishDate |
2019 |
url |
https://animorepository.dlsu.edu.ph/faculty_research/3924 |
_version_ |
1767196004236394496 |