Use of word and character N-grams for low-resourced local languages

Language identification is a text classification task for identifying the language of a given text. Several works use this as a preprocessing technique prior to sentiment analysis, mood analysis, and named entity recognition among others. Thus, building an accurate language identification engine is...

Full description

Saved in:
Bibliographic Details
Main Authors: Regalado, Ralph Vincent, Agarap, Abien Fred, Baliber, Renz Iver, Yambao, Arian, Cheng, Charibeth
Format: text
Published: Animo Repository 2019
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/faculty_research/3924
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
id oai:animorepository.dlsu.edu.ph:faculty_research-4907
record_format eprints
spelling oai:animorepository.dlsu.edu.ph:faculty_research-49072021-07-29T02:49:24Z Use of word and character N-grams for low-resourced local languages Regalado, Ralph Vincent Agarap, Abien Fred Baliber, Renz Iver Yambao, Arian Cheng, Charibeth Language identification is a text classification task for identifying the language of a given text. Several works use this as a preprocessing technique prior to sentiment analysis, mood analysis, and named entity recognition among others. Thus, building an accurate language identification engine is important given that the Philippines is home to more than 170 languages, and is scarce of language documents and resources. We compare machine learning algorithms such as Naive Bayes, Linear Support Vector Machines (SVM), and Random Forest for classification of Philippine languages. Results show that the Linear SVM model had the best performance with 0.97 Fl-score. © 2018 IEEE. 2019-01-28T08:00:00Z text https://animorepository.dlsu.edu.ph/faculty_research/3924 info:doi/10.1109/IALP.2018.8629235 Faculty Research Work Animo Repository Natural language processing (Computer science) Machine learning Computer Sciences
institution De La Salle University
building De La Salle University Library
continent Asia
country Philippines
Philippines
content_provider De La Salle University Library
collection DLSU Institutional Repository
topic Natural language processing (Computer science)
Machine learning
Computer Sciences
spellingShingle Natural language processing (Computer science)
Machine learning
Computer Sciences
Regalado, Ralph Vincent
Agarap, Abien Fred
Baliber, Renz Iver
Yambao, Arian
Cheng, Charibeth
Use of word and character N-grams for low-resourced local languages
description Language identification is a text classification task for identifying the language of a given text. Several works use this as a preprocessing technique prior to sentiment analysis, mood analysis, and named entity recognition among others. Thus, building an accurate language identification engine is important given that the Philippines is home to more than 170 languages, and is scarce of language documents and resources. We compare machine learning algorithms such as Naive Bayes, Linear Support Vector Machines (SVM), and Random Forest for classification of Philippine languages. Results show that the Linear SVM model had the best performance with 0.97 Fl-score. © 2018 IEEE.
format text
author Regalado, Ralph Vincent
Agarap, Abien Fred
Baliber, Renz Iver
Yambao, Arian
Cheng, Charibeth
author_facet Regalado, Ralph Vincent
Agarap, Abien Fred
Baliber, Renz Iver
Yambao, Arian
Cheng, Charibeth
author_sort Regalado, Ralph Vincent
title Use of word and character N-grams for low-resourced local languages
title_short Use of word and character N-grams for low-resourced local languages
title_full Use of word and character N-grams for low-resourced local languages
title_fullStr Use of word and character N-grams for low-resourced local languages
title_full_unstemmed Use of word and character N-grams for low-resourced local languages
title_sort use of word and character n-grams for low-resourced local languages
publisher Animo Repository
publishDate 2019
url https://animorepository.dlsu.edu.ph/faculty_research/3924
_version_ 1767196004236394496