Use of word and character N-grams for low-resourced local languages
Language identification is a text classification task for identifying the language of a given text. Several works use this as a preprocessing technique prior to sentiment analysis, mood analysis, and named entity recognition among others. Thus, building an accurate language identification engine is...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | text |
Published: |
Animo Repository
2019
|
Subjects: | |
Online Access: | https://animorepository.dlsu.edu.ph/faculty_research/3924 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | De La Salle University |
Summary: | Language identification is a text classification task for identifying the language of a given text. Several works use this as a preprocessing technique prior to sentiment analysis, mood analysis, and named entity recognition among others. Thus, building an accurate language identification engine is important given that the Philippines is home to more than 170 languages, and is scarce of language documents and resources. We compare machine learning algorithms such as Naive Bayes, Linear Support Vector Machines (SVM), and Random Forest for classification of Philippine languages. Results show that the Linear SVM model had the best performance with 0.97 Fl-score. © 2018 IEEE. |
---|