Statistics-based rule generation for Filipino style and grammar checking

Current research works in the area of corpus and computational linguistics are now data-driven. When dealing with data, there is a need to check sentences for variations and inconsistencies. Style and grammar checkers can be used for this purpose. However, recent technologies rely on manually develo...

Full description

Saved in:
Bibliographic Details
Main Author: Oco, Nathaniel A.
Format: text
Language:English
Published: Animo Repository 2014
Online Access:https://animorepository.dlsu.edu.ph/etd_masteral/4610
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
id oai:animorepository.dlsu.edu.ph:etd_masteral-11448
record_format eprints
spelling oai:animorepository.dlsu.edu.ph:etd_masteral-114482024-04-17T02:12:25Z Statistics-based rule generation for Filipino style and grammar checking Oco, Nathaniel A. Current research works in the area of corpus and computational linguistics are now data-driven. When dealing with data, there is a need to check sentences for variations and inconsistencies. Style and grammar checkers can be used for this purpose. However, recent technologies rely on manually developing rules, which is a time-consuming process and a herculean task. In this paper, a statistics based rule generation framework that can be used to learn spelling variations, affix usage, and common mistakes made is presented. As domain, this research is focused on the Filipino language, characterized as a language with high degree of inflection. Monolingual corpora, annotated documents, as well as a tagged data were collected. The monolingual corpus was modeled and machine learning was used to aid in detecting spelling variations the tagged data was processed and data association was applied to determine affix usage and a subset of the annotated documents was digitized and used as training data for a statistical machine translation engine to determine common mistakes made. A total of 396 variant pairs, 16 affix usage, and 22 phrase pairs were generated and transformed into rules. A subset of these linguistic phenomena was reported in the literature, an indication that the framework can be used to automate linguistic tasks. The proposed variant scoring matches the style proposed by Sentro ng Wikang Filipino (SWF) with 30% recall and matches the style proposed by the Komisyon sa Wikang (KWF) Filipino with 60% recall, an indication that the style proposed by KWF is more inclined with the variant scoring. As future work, a policy paper could be drafted in coordination with experts in language planning. 2014-01-01T08:00:00Z text https://animorepository.dlsu.edu.ph/etd_masteral/4610 Master's Theses English Animo Repository
institution De La Salle University
building De La Salle University Library
continent Asia
country Philippines
Philippines
content_provider De La Salle University Library
collection DLSU Institutional Repository
language English
description Current research works in the area of corpus and computational linguistics are now data-driven. When dealing with data, there is a need to check sentences for variations and inconsistencies. Style and grammar checkers can be used for this purpose. However, recent technologies rely on manually developing rules, which is a time-consuming process and a herculean task. In this paper, a statistics based rule generation framework that can be used to learn spelling variations, affix usage, and common mistakes made is presented. As domain, this research is focused on the Filipino language, characterized as a language with high degree of inflection. Monolingual corpora, annotated documents, as well as a tagged data were collected. The monolingual corpus was modeled and machine learning was used to aid in detecting spelling variations the tagged data was processed and data association was applied to determine affix usage and a subset of the annotated documents was digitized and used as training data for a statistical machine translation engine to determine common mistakes made. A total of 396 variant pairs, 16 affix usage, and 22 phrase pairs were generated and transformed into rules. A subset of these linguistic phenomena was reported in the literature, an indication that the framework can be used to automate linguistic tasks. The proposed variant scoring matches the style proposed by Sentro ng Wikang Filipino (SWF) with 30% recall and matches the style proposed by the Komisyon sa Wikang (KWF) Filipino with 60% recall, an indication that the style proposed by KWF is more inclined with the variant scoring. As future work, a policy paper could be drafted in coordination with experts in language planning.
format text
author Oco, Nathaniel A.
spellingShingle Oco, Nathaniel A.
Statistics-based rule generation for Filipino style and grammar checking
author_facet Oco, Nathaniel A.
author_sort Oco, Nathaniel A.
title Statistics-based rule generation for Filipino style and grammar checking
title_short Statistics-based rule generation for Filipino style and grammar checking
title_full Statistics-based rule generation for Filipino style and grammar checking
title_fullStr Statistics-based rule generation for Filipino style and grammar checking
title_full_unstemmed Statistics-based rule generation for Filipino style and grammar checking
title_sort statistics-based rule generation for filipino style and grammar checking
publisher Animo Repository
publishDate 2014
url https://animorepository.dlsu.edu.ph/etd_masteral/4610
_version_ 1797546143977046016