Use of word and character N-grams for low-resourced local languages

Language identification is a text classification task for identifying the language of a given text. Several works use this as a preprocessing technique prior to sentiment analysis, mood analysis, and named entity recognition among others. Thus, building an accurate language identification engine is...

Full description

Saved in:

Bibliographic Details
Main Authors:	Regalado, Ralph Vincent, Agarap, Abien Fred, Baliber, Renz Iver, Yambao, Arian, Cheng, Charibeth
Format:	text
Published:	Animo Repository 2019
Subjects:	Natural language processing (Computer science) Machine learning Computer Sciences
Online Access:	https://animorepository.dlsu.edu.ph/faculty_research/3924
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	De La Salle University

Description
Summary:	Language identification is a text classification task for identifying the language of a given text. Several works use this as a preprocessing technique prior to sentiment analysis, mood analysis, and named entity recognition among others. Thus, building an accurate language identification engine is important given that the Philippines is home to more than 170 languages, and is scarce of language documents and resources. We compare machine learning algorithms such as Naive Bayes, Linear Support Vector Machines (SVM), and Random Forest for classification of Philippine languages. Results show that the Linear SVM model had the best performance with 0.97 Fl-score. © 2018 IEEE.

Use of word and character N-grams for low-resourced local languages

Similar Items