HATPOST: Hybrid approach to tagalog part of speech tagging

Part of speech (POS) tagging is a process of identifying the part of speech of a word in a text. It is used in many Natural Language Processing (NLP) applications. It attempts to solve the problem of language ambiguity, the state wherein a word may have more than one meaning. There are linguistic pa...

Full description

Saved in:
Bibliographic Details
Main Authors: Ciego, Richelle Aileen C., Uy, Zheng Zhong, Huang, Juanito, Gracia, Patricia T., Torres, Maria Francesca R.
Format: text
Language:English
Published: Animo Repository 2007
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/etd_bachelors/11243
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
Description
Summary:Part of speech (POS) tagging is a process of identifying the part of speech of a word in a text. It is used in many Natural Language Processing (NLP) applications. It attempts to solve the problem of language ambiguity, the state wherein a word may have more than one meaning. There are linguistic paradigms employed to perform Part of Speech tagging, most common of which are the rule-based and statistical approaches. Rule-based approach involves tagging of words based on Simple Rule-Based Tagger (Brill, 1992) which make use of patches. Furthermore, statistical approach checks the context of the sentence by looking at the relation of one tag to another by using computed probability values of the possible tag sequences. The combination of two or more approaches, or the hybrid approach, allows the approaches to complement one another. The hybrid approach is to be implemented in Tagalog part of speech tagging to address the issue of language ambiguity. Since it is a combination of the rule-based and statistical approaches, HATPOST requires large training data to be able to generate patches and tag sequences which will aid in tagging a text. Five testing methods were conducted on HATPOST. The five methods include testing was done for every genre, incrementally, for every two corpora of different genres, for every test data that is not part of the training data but is under the same genre, and for every test data whose 95% is the training data and the corresponding results are 92.47%, 76.46%, 52.58%, 61.86%, and 92.75%, respectively, for the rule-based approach and 92.62%, 78.46%, 55.42%, 64.66%, and 93.16%, respectively, after applying the statistical approach, which is the hybrid approach. Subtracting the results of hybrid approach from those of the rule-based approach, the average improvements are 0.15%, 2.00%, 2.84%, 2.80%, and 0.41%, respectively. In the hybrid approach, the first and last testing methods have the highest accuracy while the third testing method has the lowest accuracy. High accuracy is attained with the training data in terms of content or when they belong to the same genre. Second is when the training data is larger than or about the same size of the tagging data even if there are many unknown words. with the training data in terms of content or when they belong to the same genre. Second is when the training data is larger than or about the same size of the tagging data even if there are many unknown words. with the training data in terms of content or when they belong to the same genre. Second is when the training data is larger than or about the same size of the tagging data even if there are many unknown words. On the contrary, low accuracy is the result when the training and the tagging data are different in terms of size and content. The result HATPOST’s drawback is that it cannot tag all types of named entities and cannot handle a few punctuation marks.