Statistics-based rule generation for Filipino style and grammar checking
Current research works in the area of corpus and computational linguistics are now data-driven. When dealing with data, there is a need to check sentences for variations and inconsistencies. Style and grammar checkers can be used for this purpose. However, recent technologies rely on manually develo...
Saved in:
Main Author: | |
---|---|
Format: | text |
Language: | English |
Published: |
Animo Repository
2014
|
Online Access: | https://animorepository.dlsu.edu.ph/etd_masteral/4610 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | De La Salle University |
Language: | English |
id |
oai:animorepository.dlsu.edu.ph:etd_masteral-11448 |
---|---|
record_format |
eprints |
spelling |
oai:animorepository.dlsu.edu.ph:etd_masteral-114482024-04-17T02:12:25Z Statistics-based rule generation for Filipino style and grammar checking Oco, Nathaniel A. Current research works in the area of corpus and computational linguistics are now data-driven. When dealing with data, there is a need to check sentences for variations and inconsistencies. Style and grammar checkers can be used for this purpose. However, recent technologies rely on manually developing rules, which is a time-consuming process and a herculean task. In this paper, a statistics based rule generation framework that can be used to learn spelling variations, affix usage, and common mistakes made is presented. As domain, this research is focused on the Filipino language, characterized as a language with high degree of inflection. Monolingual corpora, annotated documents, as well as a tagged data were collected. The monolingual corpus was modeled and machine learning was used to aid in detecting spelling variations the tagged data was processed and data association was applied to determine affix usage and a subset of the annotated documents was digitized and used as training data for a statistical machine translation engine to determine common mistakes made. A total of 396 variant pairs, 16 affix usage, and 22 phrase pairs were generated and transformed into rules. A subset of these linguistic phenomena was reported in the literature, an indication that the framework can be used to automate linguistic tasks. The proposed variant scoring matches the style proposed by Sentro ng Wikang Filipino (SWF) with 30% recall and matches the style proposed by the Komisyon sa Wikang (KWF) Filipino with 60% recall, an indication that the style proposed by KWF is more inclined with the variant scoring. As future work, a policy paper could be drafted in coordination with experts in language planning. 2014-01-01T08:00:00Z text https://animorepository.dlsu.edu.ph/etd_masteral/4610 Master's Theses English Animo Repository |
institution |
De La Salle University |
building |
De La Salle University Library |
continent |
Asia |
country |
Philippines Philippines |
content_provider |
De La Salle University Library |
collection |
DLSU Institutional Repository |
language |
English |
description |
Current research works in the area of corpus and computational linguistics are now data-driven. When dealing with data, there is a need to check sentences for variations and inconsistencies. Style and grammar checkers can be used for this purpose. However, recent technologies rely on manually developing rules, which is a time-consuming process and a herculean task. In this paper, a statistics based rule generation framework that can be used to learn spelling variations, affix usage, and common mistakes made is presented. As domain, this research is focused on the Filipino language, characterized as a language with high degree of inflection. Monolingual corpora, annotated documents, as well as a tagged data were collected. The monolingual corpus was modeled and machine learning was used to aid in detecting spelling variations the tagged data was processed and data association was applied to determine affix usage and a subset of the annotated documents was digitized and used as training data for a statistical machine translation engine to determine common mistakes made. A total of 396 variant pairs, 16 affix usage, and 22 phrase pairs were generated and transformed into rules. A subset of these linguistic phenomena was reported in the literature, an indication that the framework can be used to automate linguistic tasks. The proposed variant scoring matches the style proposed by Sentro ng Wikang Filipino (SWF) with 30% recall and matches the style proposed by the Komisyon sa Wikang (KWF) Filipino with 60% recall, an indication that the style proposed by KWF is more inclined with the variant scoring. As future work, a policy paper could be drafted in coordination with experts in language planning. |
format |
text |
author |
Oco, Nathaniel A. |
spellingShingle |
Oco, Nathaniel A. Statistics-based rule generation for Filipino style and grammar checking |
author_facet |
Oco, Nathaniel A. |
author_sort |
Oco, Nathaniel A. |
title |
Statistics-based rule generation for Filipino style and grammar checking |
title_short |
Statistics-based rule generation for Filipino style and grammar checking |
title_full |
Statistics-based rule generation for Filipino style and grammar checking |
title_fullStr |
Statistics-based rule generation for Filipino style and grammar checking |
title_full_unstemmed |
Statistics-based rule generation for Filipino style and grammar checking |
title_sort |
statistics-based rule generation for filipino style and grammar checking |
publisher |
Animo Repository |
publishDate |
2014 |
url |
https://animorepository.dlsu.edu.ph/etd_masteral/4610 |
_version_ |
1797546143977046016 |