A STATISTICAL AND RULE-BASED GRAMMAR CHECKER FOR INDONESIAN TEXT
Spelling and grammar errors in Indonesian text are not an uncommon occurrence, even in formal contexts such as academic or bureaucratic documents. Meanwhile, the use of proper language is essential for expressing ideas and thoughts clearly in written text. Spelling and grammar checkers are widely-us...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/21279 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:21279 |
---|---|
spelling |
id-itb.:212792017-10-09T10:28:07ZA STATISTICAL AND RULE-BASED GRAMMAR CHECKER FOR INDONESIAN TEXT FAHDA (NIM : 13513079), ASANILTA Indonesia Final Project INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/21279 Spelling and grammar errors in Indonesian text are not an uncommon occurrence, even in formal contexts such as academic or bureaucratic documents. Meanwhile, the use of proper language is essential for expressing ideas and thoughts clearly in written text. Spelling and grammar checkers are widely-used tools which aim to help in detecting and correcting various writing errors. However, there are currently no proofreading systems capable of checking both spelling and grammar errors in Indonesian text. Therefore, this study proposes an Indonesian spelling and grammar checker prototype which uses a combination of rules and statistical methods. <br /> <br /> <br /> There are currently 38 rules from regular expressions which detect, correct, and explain common errors in punctuation, word choice, and spelling. The spelling checker then examines every word using a dictionary trie to find misspellings and Damerau-Levenshtein distance neighbors as correction candidates, as well as morphological analysis for processing certain word forms. A bigram or co-occurrence-based Hidden Markov Model is used for ranking and selecting the candidates. The grammar checker uses a trigram language model from tokens, POS tags, or phrase chunks for identifying sentences with incorrect structures according to a threshold value chosen empirically. <br /> <br /> <br /> By experiment, the co-occurrence HMM with an emission probability weight coefficient of 0.95 and transition probability weight coefficient of 0.05 is selected as the most suitable model for the spelling checker. As for the grammar checker, the phrase chunk model which normalizes by chunk length and uses a threshold score of -0.4 gave the best results. The parameter values achieving the best results are applied in the final system. The document evaluation of this system showed an overall accuracy of 83.18% and the prototype is implemented as a web application. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Spelling and grammar errors in Indonesian text are not an uncommon occurrence, even in formal contexts such as academic or bureaucratic documents. Meanwhile, the use of proper language is essential for expressing ideas and thoughts clearly in written text. Spelling and grammar checkers are widely-used tools which aim to help in detecting and correcting various writing errors. However, there are currently no proofreading systems capable of checking both spelling and grammar errors in Indonesian text. Therefore, this study proposes an Indonesian spelling and grammar checker prototype which uses a combination of rules and statistical methods. <br />
<br />
<br />
There are currently 38 rules from regular expressions which detect, correct, and explain common errors in punctuation, word choice, and spelling. The spelling checker then examines every word using a dictionary trie to find misspellings and Damerau-Levenshtein distance neighbors as correction candidates, as well as morphological analysis for processing certain word forms. A bigram or co-occurrence-based Hidden Markov Model is used for ranking and selecting the candidates. The grammar checker uses a trigram language model from tokens, POS tags, or phrase chunks for identifying sentences with incorrect structures according to a threshold value chosen empirically. <br />
<br />
<br />
By experiment, the co-occurrence HMM with an emission probability weight coefficient of 0.95 and transition probability weight coefficient of 0.05 is selected as the most suitable model for the spelling checker. As for the grammar checker, the phrase chunk model which normalizes by chunk length and uses a threshold score of -0.4 gave the best results. The parameter values achieving the best results are applied in the final system. The document evaluation of this system showed an overall accuracy of 83.18% and the prototype is implemented as a web application. |
format |
Final Project |
author |
FAHDA (NIM : 13513079), ASANILTA |
spellingShingle |
FAHDA (NIM : 13513079), ASANILTA A STATISTICAL AND RULE-BASED GRAMMAR CHECKER FOR INDONESIAN TEXT |
author_facet |
FAHDA (NIM : 13513079), ASANILTA |
author_sort |
FAHDA (NIM : 13513079), ASANILTA |
title |
A STATISTICAL AND RULE-BASED GRAMMAR CHECKER FOR INDONESIAN TEXT |
title_short |
A STATISTICAL AND RULE-BASED GRAMMAR CHECKER FOR INDONESIAN TEXT |
title_full |
A STATISTICAL AND RULE-BASED GRAMMAR CHECKER FOR INDONESIAN TEXT |
title_fullStr |
A STATISTICAL AND RULE-BASED GRAMMAR CHECKER FOR INDONESIAN TEXT |
title_full_unstemmed |
A STATISTICAL AND RULE-BASED GRAMMAR CHECKER FOR INDONESIAN TEXT |
title_sort |
statistical and rule-based grammar checker for indonesian text |
url |
https://digilib.itb.ac.id/gdl/view/21279 |
_version_ |
1822019454515544064 |