HATPOST: Hybrid approach to tagalog part of speech tagging

Part of speech (POS) tagging is a process of identifying the part of speech of a word in a text. It is used in many Natural Language Processing (NLP) applications. It attempts to solve the problem of language ambiguity, the state wherein a word may have more than one meaning. There are linguistic pa...

Full description

Saved in:
Bibliographic Details
Main Authors: Ciego, Richelle Aileen C., Uy, Zheng Zhong, Huang, Juanito, Gracia, Patricia T., Torres, Maria Francesca R.
Format: text
Language:English
Published: Animo Repository 2007
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/etd_bachelors/11243
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
id oai:animorepository.dlsu.edu.ph:etd_bachelors-11888
record_format eprints
spelling oai:animorepository.dlsu.edu.ph:etd_bachelors-118882022-03-04T02:48:20Z HATPOST: Hybrid approach to tagalog part of speech tagging Ciego, Richelle Aileen C. Uy, Zheng Zhong Huang, Juanito Gracia, Patricia T. Torres, Maria Francesca R. Part of speech (POS) tagging is a process of identifying the part of speech of a word in a text. It is used in many Natural Language Processing (NLP) applications. It attempts to solve the problem of language ambiguity, the state wherein a word may have more than one meaning. There are linguistic paradigms employed to perform Part of Speech tagging, most common of which are the rule-based and statistical approaches. Rule-based approach involves tagging of words based on Simple Rule-Based Tagger (Brill, 1992) which make use of patches. Furthermore, statistical approach checks the context of the sentence by looking at the relation of one tag to another by using computed probability values of the possible tag sequences. The combination of two or more approaches, or the hybrid approach, allows the approaches to complement one another. The hybrid approach is to be implemented in Tagalog part of speech tagging to address the issue of language ambiguity. Since it is a combination of the rule-based and statistical approaches, HATPOST requires large training data to be able to generate patches and tag sequences which will aid in tagging a text. Five testing methods were conducted on HATPOST. The five methods include testing was done for every genre, incrementally, for every two corpora of different genres, for every test data that is not part of the training data but is under the same genre, and for every test data whose 95% is the training data and the corresponding results are 92.47%, 76.46%, 52.58%, 61.86%, and 92.75%, respectively, for the rule-based approach and 92.62%, 78.46%, 55.42%, 64.66%, and 93.16%, respectively, after applying the statistical approach, which is the hybrid approach. Subtracting the results of hybrid approach from those of the rule-based approach, the average improvements are 0.15%, 2.00%, 2.84%, 2.80%, and 0.41%, respectively. In the hybrid approach, the first and last testing methods have the highest accuracy while the third testing method has the lowest accuracy. High accuracy is attained with the training data in terms of content or when they belong to the same genre. Second is when the training data is larger than or about the same size of the tagging data even if there are many unknown words. with the training data in terms of content or when they belong to the same genre. Second is when the training data is larger than or about the same size of the tagging data even if there are many unknown words. with the training data in terms of content or when they belong to the same genre. Second is when the training data is larger than or about the same size of the tagging data even if there are many unknown words. On the contrary, low accuracy is the result when the training and the tagging data are different in terms of size and content. The result HATPOST’s drawback is that it cannot tag all types of named entities and cannot handle a few punctuation marks. 2007-01-01T08:00:00Z text https://animorepository.dlsu.edu.ph/etd_bachelors/11243 Bachelor's Theses English Animo Repository Natural language processing (Computer science) Computer Sciences
institution De La Salle University
building De La Salle University Library
continent Asia
country Philippines
Philippines
content_provider De La Salle University Library
collection DLSU Institutional Repository
language English
topic Natural language processing (Computer science)
Computer Sciences
spellingShingle Natural language processing (Computer science)
Computer Sciences
Ciego, Richelle Aileen C.
Uy, Zheng Zhong
Huang, Juanito
Gracia, Patricia T.
Torres, Maria Francesca R.
HATPOST: Hybrid approach to tagalog part of speech tagging
description Part of speech (POS) tagging is a process of identifying the part of speech of a word in a text. It is used in many Natural Language Processing (NLP) applications. It attempts to solve the problem of language ambiguity, the state wherein a word may have more than one meaning. There are linguistic paradigms employed to perform Part of Speech tagging, most common of which are the rule-based and statistical approaches. Rule-based approach involves tagging of words based on Simple Rule-Based Tagger (Brill, 1992) which make use of patches. Furthermore, statistical approach checks the context of the sentence by looking at the relation of one tag to another by using computed probability values of the possible tag sequences. The combination of two or more approaches, or the hybrid approach, allows the approaches to complement one another. The hybrid approach is to be implemented in Tagalog part of speech tagging to address the issue of language ambiguity. Since it is a combination of the rule-based and statistical approaches, HATPOST requires large training data to be able to generate patches and tag sequences which will aid in tagging a text. Five testing methods were conducted on HATPOST. The five methods include testing was done for every genre, incrementally, for every two corpora of different genres, for every test data that is not part of the training data but is under the same genre, and for every test data whose 95% is the training data and the corresponding results are 92.47%, 76.46%, 52.58%, 61.86%, and 92.75%, respectively, for the rule-based approach and 92.62%, 78.46%, 55.42%, 64.66%, and 93.16%, respectively, after applying the statistical approach, which is the hybrid approach. Subtracting the results of hybrid approach from those of the rule-based approach, the average improvements are 0.15%, 2.00%, 2.84%, 2.80%, and 0.41%, respectively. In the hybrid approach, the first and last testing methods have the highest accuracy while the third testing method has the lowest accuracy. High accuracy is attained with the training data in terms of content or when they belong to the same genre. Second is when the training data is larger than or about the same size of the tagging data even if there are many unknown words. with the training data in terms of content or when they belong to the same genre. Second is when the training data is larger than or about the same size of the tagging data even if there are many unknown words. with the training data in terms of content or when they belong to the same genre. Second is when the training data is larger than or about the same size of the tagging data even if there are many unknown words. On the contrary, low accuracy is the result when the training and the tagging data are different in terms of size and content. The result HATPOST’s drawback is that it cannot tag all types of named entities and cannot handle a few punctuation marks.
format text
author Ciego, Richelle Aileen C.
Uy, Zheng Zhong
Huang, Juanito
Gracia, Patricia T.
Torres, Maria Francesca R.
author_facet Ciego, Richelle Aileen C.
Uy, Zheng Zhong
Huang, Juanito
Gracia, Patricia T.
Torres, Maria Francesca R.
author_sort Ciego, Richelle Aileen C.
title HATPOST: Hybrid approach to tagalog part of speech tagging
title_short HATPOST: Hybrid approach to tagalog part of speech tagging
title_full HATPOST: Hybrid approach to tagalog part of speech tagging
title_fullStr HATPOST: Hybrid approach to tagalog part of speech tagging
title_full_unstemmed HATPOST: Hybrid approach to tagalog part of speech tagging
title_sort hatpost: hybrid approach to tagalog part of speech tagging
publisher Animo Repository
publishDate 2007
url https://animorepository.dlsu.edu.ph/etd_bachelors/11243
_version_ 1728621055393660928