Statistics-based rule generation for Filipino style and grammar checking

Current research works in the area of corpus and computational linguistics are now data-driven. When dealing with data, there is a need to check sentences for variations and inconsistencies. Style and grammar checkers can be used for this purpose. However, recent technologies rely on manually develo...

Full description

Saved in:

Bibliographic Details
Main Author:	Oco, Nathaniel A.
Format:	text
Language:	English
Published:	Animo Repository 2014
Online Access:	https://animorepository.dlsu.edu.ph/etd_masteral/4610
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	De La Salle University
Language:	English

id	oai:animorepository.dlsu.edu.ph:etd_masteral-11448
record_format	eprints
spelling	oai:animorepository.dlsu.edu.ph:etd_masteral-114482024-04-17T02:12:25Z Statistics-based rule generation for Filipino style and grammar checking Oco, Nathaniel A. Current research works in the area of corpus and computational linguistics are now data-driven. When dealing with data, there is a need to check sentences for variations and inconsistencies. Style and grammar checkers can be used for this purpose. However, recent technologies rely on manually developing rules, which is a time-consuming process and a herculean task. In this paper, a statistics based rule generation framework that can be used to learn spelling variations, affix usage, and common mistakes made is presented. As domain, this research is focused on the Filipino language, characterized as a language with high degree of inflection. Monolingual corpora, annotated documents, as well as a tagged data were collected. The monolingual corpus was modeled and machine learning was used to aid in detecting spelling variations the tagged data was processed and data association was applied to determine affix usage and a subset of the annotated documents was digitized and used as training data for a statistical machine translation engine to determine common mistakes made. A total of 396 variant pairs, 16 affix usage, and 22 phrase pairs were generated and transformed into rules. A subset of these linguistic phenomena was reported in the literature, an indication that the framework can be used to automate linguistic tasks. The proposed variant scoring matches the style proposed by Sentro ng Wikang Filipino (SWF) with 30% recall and matches the style proposed by the Komisyon sa Wikang (KWF) Filipino with 60% recall, an indication that the style proposed by KWF is more inclined with the variant scoring. As future work, a policy paper could be drafted in coordination with experts in language planning. 2014-01-01T08:00:00Z text https://animorepository.dlsu.edu.ph/etd_masteral/4610 Master's Theses English Animo Repository
institution	De La Salle University
building	De La Salle University Library
continent	Asia
country	Philippines Philippines
content_provider	De La Salle University Library
collection	DLSU Institutional Repository
language	English
description	Current research works in the area of corpus and computational linguistics are now data-driven. When dealing with data, there is a need to check sentences for variations and inconsistencies. Style and grammar checkers can be used for this purpose. However, recent technologies rely on manually developing rules, which is a time-consuming process and a herculean task. In this paper, a statistics based rule generation framework that can be used to learn spelling variations, affix usage, and common mistakes made is presented. As domain, this research is focused on the Filipino language, characterized as a language with high degree of inflection. Monolingual corpora, annotated documents, as well as a tagged data were collected. The monolingual corpus was modeled and machine learning was used to aid in detecting spelling variations the tagged data was processed and data association was applied to determine affix usage and a subset of the annotated documents was digitized and used as training data for a statistical machine translation engine to determine common mistakes made. A total of 396 variant pairs, 16 affix usage, and 22 phrase pairs were generated and transformed into rules. A subset of these linguistic phenomena was reported in the literature, an indication that the framework can be used to automate linguistic tasks. The proposed variant scoring matches the style proposed by Sentro ng Wikang Filipino (SWF) with 30% recall and matches the style proposed by the Komisyon sa Wikang (KWF) Filipino with 60% recall, an indication that the style proposed by KWF is more inclined with the variant scoring. As future work, a policy paper could be drafted in coordination with experts in language planning.
format	text
author	Oco, Nathaniel A.
spellingShingle	Oco, Nathaniel A. Statistics-based rule generation for Filipino style and grammar checking
author_facet	Oco, Nathaniel A.
author_sort	Oco, Nathaniel A.
title	Statistics-based rule generation for Filipino style and grammar checking
title_short	Statistics-based rule generation for Filipino style and grammar checking
title_full	Statistics-based rule generation for Filipino style and grammar checking
title_fullStr	Statistics-based rule generation for Filipino style and grammar checking
title_full_unstemmed	Statistics-based rule generation for Filipino style and grammar checking
title_sort	statistics-based rule generation for filipino style and grammar checking
publisher	Animo Repository
publishDate	2014
url	https://animorepository.dlsu.edu.ph/etd_masteral/4610
_version_	1797546143977046016

Statistics-based rule generation for Filipino style and grammar checking

Similar Items