An Indonesian resource grammar (INDRA) : and its application to a treebank (JATI)

This dissertation describes the creation and the development of an open-source, broadcoverage Indonesian computational grammar, called Indonesian Resource Grammar (INDRA), within the framework of Head-Driven Phrase Structure Grammar (HPSG) (Pollard & Sag, 1994; Sag et al., 2003) and Minimal Recu...

Full description

Saved in:
Bibliographic Details
Main Author: Moeljadi, David
Other Authors: Francis Bond
Format: Theses and Dissertations
Language:English
Published: 2018
Subjects:
Online Access:https://hdl.handle.net/10356/82540
http://hdl.handle.net/10220/46580
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-82540
record_format dspace
institution Nanyang Technological University
building NTU Library
country Singapore
collection DR-NTU
language English
topic DRNTU::Humanities::Linguistics
spellingShingle DRNTU::Humanities::Linguistics
Moeljadi, David
An Indonesian resource grammar (INDRA) : and its application to a treebank (JATI)
description This dissertation describes the creation and the development of an open-source, broadcoverage Indonesian computational grammar, called Indonesian Resource Grammar (INDRA), within the framework of Head-Driven Phrase Structure Grammar (HPSG) (Pollard & Sag, 1994; Sag et al., 2003) and Minimal Recursion Semantics (MRS) (Copestake et al., 2005), using computational tools and resources developed by the DEep Linguistic Processing with HPSG-INitiative (DELPH-IN) research consortium. As a resource grammar, INDRA was employed to build an open-source treebank, called JATI. The research I have conducted on INDRA and its application to JATI was done in four years, from January 2014 to January 2018, during my PhD candidature. Previous work on the computational grammar of Indonesian are mainly done in the framework of Lexical-Functional Grammar (LFG) (Kaplan & Bresnan, 1982; Dalrymple, 2001) such as Arka (2010a) and Mistica (2013). A computational grammar of Indonesian called IndoGram (Arka, 2012) was developed within the LFG-based Parallel Grammar (ParGram) framework, using the Xerox Linguistic Environment (XLE) parser. To the best of my knowledge, no work on Indonesian HPSG has been done. Thus, the development of INDRA can also function as an investigation of the cross-linguistic potency of HPSG and MRS. The approach taken is a corpus-driven approach. The scope is on the analysis and computational implementation of some basic Indonesian constructions and some phenomena in the Indonesian text: from the Nanyang Technological University Multilingual Corpus (NTU-MC) (Tan & Bond, 2012) and from definition sentences in the fifth edition of Kamus Besar Bahasa Indonesia (KBBI) (Amalia, 2016); the later contains 2,003 sentences and was treebanked, named JATI. The lexicon was semi-automatically acquired from various sources: the English Resource Grammar (ERG) (Copestake & Flickinger, 2000) via Wordnet Bahasa (Nurril Hirfana Mohamed Noor et al., 2011; Bond et al., 2014), the NTU-MC, and the KBBI definition sentence corpus. The coverage, i.e. the quality and the quantity of parsed sentences in the corpus by the grammar, is evaluated using test-suites. INDRA can parse and generate complex noun phrases with clitics, determiners, numerals, classifiers, and defining relative clause; verb phrases with auxiliaries and voice markers; major copula constructions; compounds; coordination of words and phrases with the same part-of-speech; and subordination. However, at the time of submission, INDRA still cannot handle phenomena such as equative, comparative, and superlative adjective phrases; coordination of words and phrases of different parts-of-speech; possessor topiccomment relative clause with more than one comment; imperatives; and constructions with Wh-question words. These are for future work. Despite its limitations, compared with IndoGram, INDRA has more precision in the analyses for some phenomena and has fifteen times more sentences in the open-source treebank. In addition, INDRA has the potential to be used in various applications such as multilingual machine translation and computer-assisted language learning. Since INDRA is developed in the DELPH-IN community along with other grammars such as the English Resource Grammar (ERG) (Flickinger et al., 2010) using the same semantics (MRS), a semantic-transfer-based machine translation system can be easily built. In summary, INDRA serves as the first, open-source computational grammar for Indonesian which covers most of the important constructions. INDRA has reached to a stage that it has the potential to be applied to various applications such as treebanking, machine translation, and computer-assisted language learning.
author2 Francis Bond
author_facet Francis Bond
Moeljadi, David
format Theses and Dissertations
author Moeljadi, David
author_sort Moeljadi, David
title An Indonesian resource grammar (INDRA) : and its application to a treebank (JATI)
title_short An Indonesian resource grammar (INDRA) : and its application to a treebank (JATI)
title_full An Indonesian resource grammar (INDRA) : and its application to a treebank (JATI)
title_fullStr An Indonesian resource grammar (INDRA) : and its application to a treebank (JATI)
title_full_unstemmed An Indonesian resource grammar (INDRA) : and its application to a treebank (JATI)
title_sort indonesian resource grammar (indra) : and its application to a treebank (jati)
publishDate 2018
url https://hdl.handle.net/10356/82540
http://hdl.handle.net/10220/46580
_version_ 1681056986673709056
spelling sg-ntu-dr.10356-825402020-10-15T06:30:44Z An Indonesian resource grammar (INDRA) : and its application to a treebank (JATI) Moeljadi, David Francis Bond School of Humanities DRNTU::Humanities::Linguistics This dissertation describes the creation and the development of an open-source, broadcoverage Indonesian computational grammar, called Indonesian Resource Grammar (INDRA), within the framework of Head-Driven Phrase Structure Grammar (HPSG) (Pollard & Sag, 1994; Sag et al., 2003) and Minimal Recursion Semantics (MRS) (Copestake et al., 2005), using computational tools and resources developed by the DEep Linguistic Processing with HPSG-INitiative (DELPH-IN) research consortium. As a resource grammar, INDRA was employed to build an open-source treebank, called JATI. The research I have conducted on INDRA and its application to JATI was done in four years, from January 2014 to January 2018, during my PhD candidature. Previous work on the computational grammar of Indonesian are mainly done in the framework of Lexical-Functional Grammar (LFG) (Kaplan & Bresnan, 1982; Dalrymple, 2001) such as Arka (2010a) and Mistica (2013). A computational grammar of Indonesian called IndoGram (Arka, 2012) was developed within the LFG-based Parallel Grammar (ParGram) framework, using the Xerox Linguistic Environment (XLE) parser. To the best of my knowledge, no work on Indonesian HPSG has been done. Thus, the development of INDRA can also function as an investigation of the cross-linguistic potency of HPSG and MRS. The approach taken is a corpus-driven approach. The scope is on the analysis and computational implementation of some basic Indonesian constructions and some phenomena in the Indonesian text: from the Nanyang Technological University Multilingual Corpus (NTU-MC) (Tan & Bond, 2012) and from definition sentences in the fifth edition of Kamus Besar Bahasa Indonesia (KBBI) (Amalia, 2016); the later contains 2,003 sentences and was treebanked, named JATI. The lexicon was semi-automatically acquired from various sources: the English Resource Grammar (ERG) (Copestake & Flickinger, 2000) via Wordnet Bahasa (Nurril Hirfana Mohamed Noor et al., 2011; Bond et al., 2014), the NTU-MC, and the KBBI definition sentence corpus. The coverage, i.e. the quality and the quantity of parsed sentences in the corpus by the grammar, is evaluated using test-suites. INDRA can parse and generate complex noun phrases with clitics, determiners, numerals, classifiers, and defining relative clause; verb phrases with auxiliaries and voice markers; major copula constructions; compounds; coordination of words and phrases with the same part-of-speech; and subordination. However, at the time of submission, INDRA still cannot handle phenomena such as equative, comparative, and superlative adjective phrases; coordination of words and phrases of different parts-of-speech; possessor topiccomment relative clause with more than one comment; imperatives; and constructions with Wh-question words. These are for future work. Despite its limitations, compared with IndoGram, INDRA has more precision in the analyses for some phenomena and has fifteen times more sentences in the open-source treebank. In addition, INDRA has the potential to be used in various applications such as multilingual machine translation and computer-assisted language learning. Since INDRA is developed in the DELPH-IN community along with other grammars such as the English Resource Grammar (ERG) (Flickinger et al., 2010) using the same semantics (MRS), a semantic-transfer-based machine translation system can be easily built. In summary, INDRA serves as the first, open-source computational grammar for Indonesian which covers most of the important constructions. INDRA has reached to a stage that it has the potential to be applied to various applications such as treebanking, machine translation, and computer-assisted language learning. Doctor of Philosophy 2018-11-07T13:14:55Z 2019-12-06T14:57:36Z 2018-11-07T13:14:55Z 2019-12-06T14:57:36Z 2018 Thesis Moeljadi, D. (2018). An Indonesian resource grammar (INDRA) : and its application to a treebank (JATI). Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/82540 http://hdl.handle.net/10220/46580 10.32657/10220/46580 en 293 p. application/pdf