INDONESIAN SHIFT-REDUCE CONSTITUENCY PARSER USING FEATURE TEMPLATES & BEAM SEARCH STRATEGY

<p align="justify">In natural language processing, the syntactic analysis stage is a basic step aimed at understanding the context of a sentence in a natural language. One way to analyze syntax is with the decomposition of constituents by constituency parsing. Constituency parsing is...

Full description

Saved in:
Bibliographic Details
Main Author: SEBASTIAN HERLIM - NIM : 13514061 , ROBERT
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/30649
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:30649
spelling id-itb.:306492018-07-02T08:48:25ZINDONESIAN SHIFT-REDUCE CONSTITUENCY PARSER USING FEATURE TEMPLATES & BEAM SEARCH STRATEGY SEBASTIAN HERLIM - NIM : 13514061 , ROBERT Indonesia Final Project INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/30649 <p align="justify">In natural language processing, the syntactic analysis stage is a basic step aimed at understanding the context of a sentence in a natural language. One way to analyze syntax is with the decomposition of constituents by constituency parsing. Constituency parsing is the process of extracting the phrase information contained in a sentence. In this undergraduate thesis, a constituency parser is built using shiftreduce technique by using new INACL treebank (consisting of 11,356 and 4,457 constituent trees for learning and evaluation respectively). The INACL treebank still contains a number of Part-of-Speech labeling errors so that an improvement effort with variation detection techniques is also performed. This undergraduate thesis introduces modifications of constituency parsing technique in the binarization process and feature multiplication factor. The modification in binarization process is inspired by chunking variations in Named-Entity Recognition, while feature multiplication factor was an attempt to improve the scoring system of the parser. <br /> <br /> <br /> The parsing technique used is transition-based constituency parsing with feature templates and beam search strategy. In this technique, the parsing stage is seen as a searching process of state with the best score. For search optimization, the search space is limited by holding only k best state on an agenda. Each state is scored by using a structured model obtained with perceptron algorithm. Features for learning are extracted from the state by using a number of feature templates. The output of the learning model is a series of shift-reduce actions that can be converted into constituent tree structures. <br /> <br /> <br /> Methods undertaken in this undergraduate thesis include preparation and correction of the corpus, solution design, implementation, experimentation, and analysis of parsing results. The experimental stage aims to find 8 best configurations for optimizing the parser performance, including: use of train data, number of iterations, binarization methods, n-best parse, Zhu additional features, feature multiplication factors, head word selection rules, and amount of training data. The error analysis stage is performed to find the weakness of the parser so that it can be improved for further research. <br /> <br /> <br /> From the experiments conducted, the best configurations obtained were: using Partof-Speech corrected corpus, 10 times learning iteration, Inner-Outer-End (IOE) binarization technique, 1-best parse, using multiplication factor for bigram and trigram feature template, always right node for head word selection rule, and use smaller-sized training data than larger-sized training data. Evaluation using the INACL treebank resulted in f1-score of 50.3%, lower than Stanford Parser which resulted in 57.5% f1-score. Evaluation using the IDN-Treebank resulted in f1-score of 74.0%, which was also lower than Trance Parser which resulted in 74.91%. <p align="justify"> text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description <p align="justify">In natural language processing, the syntactic analysis stage is a basic step aimed at understanding the context of a sentence in a natural language. One way to analyze syntax is with the decomposition of constituents by constituency parsing. Constituency parsing is the process of extracting the phrase information contained in a sentence. In this undergraduate thesis, a constituency parser is built using shiftreduce technique by using new INACL treebank (consisting of 11,356 and 4,457 constituent trees for learning and evaluation respectively). The INACL treebank still contains a number of Part-of-Speech labeling errors so that an improvement effort with variation detection techniques is also performed. This undergraduate thesis introduces modifications of constituency parsing technique in the binarization process and feature multiplication factor. The modification in binarization process is inspired by chunking variations in Named-Entity Recognition, while feature multiplication factor was an attempt to improve the scoring system of the parser. <br /> <br /> <br /> The parsing technique used is transition-based constituency parsing with feature templates and beam search strategy. In this technique, the parsing stage is seen as a searching process of state with the best score. For search optimization, the search space is limited by holding only k best state on an agenda. Each state is scored by using a structured model obtained with perceptron algorithm. Features for learning are extracted from the state by using a number of feature templates. The output of the learning model is a series of shift-reduce actions that can be converted into constituent tree structures. <br /> <br /> <br /> Methods undertaken in this undergraduate thesis include preparation and correction of the corpus, solution design, implementation, experimentation, and analysis of parsing results. The experimental stage aims to find 8 best configurations for optimizing the parser performance, including: use of train data, number of iterations, binarization methods, n-best parse, Zhu additional features, feature multiplication factors, head word selection rules, and amount of training data. The error analysis stage is performed to find the weakness of the parser so that it can be improved for further research. <br /> <br /> <br /> From the experiments conducted, the best configurations obtained were: using Partof-Speech corrected corpus, 10 times learning iteration, Inner-Outer-End (IOE) binarization technique, 1-best parse, using multiplication factor for bigram and trigram feature template, always right node for head word selection rule, and use smaller-sized training data than larger-sized training data. Evaluation using the INACL treebank resulted in f1-score of 50.3%, lower than Stanford Parser which resulted in 57.5% f1-score. Evaluation using the IDN-Treebank resulted in f1-score of 74.0%, which was also lower than Trance Parser which resulted in 74.91%. <p align="justify">
format Final Project
author SEBASTIAN HERLIM - NIM : 13514061 , ROBERT
spellingShingle SEBASTIAN HERLIM - NIM : 13514061 , ROBERT
INDONESIAN SHIFT-REDUCE CONSTITUENCY PARSER USING FEATURE TEMPLATES & BEAM SEARCH STRATEGY
author_facet SEBASTIAN HERLIM - NIM : 13514061 , ROBERT
author_sort SEBASTIAN HERLIM - NIM : 13514061 , ROBERT
title INDONESIAN SHIFT-REDUCE CONSTITUENCY PARSER USING FEATURE TEMPLATES & BEAM SEARCH STRATEGY
title_short INDONESIAN SHIFT-REDUCE CONSTITUENCY PARSER USING FEATURE TEMPLATES & BEAM SEARCH STRATEGY
title_full INDONESIAN SHIFT-REDUCE CONSTITUENCY PARSER USING FEATURE TEMPLATES & BEAM SEARCH STRATEGY
title_fullStr INDONESIAN SHIFT-REDUCE CONSTITUENCY PARSER USING FEATURE TEMPLATES & BEAM SEARCH STRATEGY
title_full_unstemmed INDONESIAN SHIFT-REDUCE CONSTITUENCY PARSER USING FEATURE TEMPLATES & BEAM SEARCH STRATEGY
title_sort indonesian shift-reduce constituency parser using feature templates & beam search strategy
url https://digilib.itb.ac.id/gdl/view/30649
_version_ 1822923335181271040