Korean jamo-level byte-pair encoding for neural machine translation

Tokenization is the very first step in most Natural Language Processing tasks, and is essential in addressing the fundamental out-of-vocabulary problem, as well as in changing the linguistic understanding. To exploit the characteristics of the Korean language for a more parameter-efficient tokenizat...

Full description

Saved in:
Bibliographic Details
Main Author: Lee, Junyoung
Other Authors: Wang Lipo
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/172737
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-172737
record_format dspace
spelling sg-ntu-dr.10356-1727372023-12-22T15:42:12Z Korean jamo-level byte-pair encoding for neural machine translation Lee, Junyoung Wang Lipo School of Electrical and Electronic Engineering Tokyo Institute of Technology ELPWang@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Document and text processing Tokenization is the very first step in most Natural Language Processing tasks, and is essential in addressing the fundamental out-of-vocabulary problem, as well as in changing the linguistic understanding. To exploit the characteristics of the Korean language for a more parameter-efficient tokenization strategy in Neural Machine Translation pipeline, this project considers the compositional nature of Korean syllables. An alphabet-level tokenization is introduced in combination with Byte-Pair Encoding, together with a mitigation strategy to address potential invalidities in the generated sequence. Experimental results demonstrate that the proposed tokenization method show improvements in both BLEU and chrF compared to syllable-based baselines in English-to-Korean translation task. The codebase for this project is available on https://github.com/jylee-k/joeynmt/tree/ token masking. Bachelor of Engineering (Electrical and Electronic Engineering) 2023-12-19T07:01:11Z 2023-12-19T07:01:11Z 2023 Final Year Project (FYP) Lee, J. (2023). Korean jamo-level byte-pair encoding for neural machine translation. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/172737 https://hdl.handle.net/10356/172737 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Document and text processing
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Lee, Junyoung
Korean jamo-level byte-pair encoding for neural machine translation
description Tokenization is the very first step in most Natural Language Processing tasks, and is essential in addressing the fundamental out-of-vocabulary problem, as well as in changing the linguistic understanding. To exploit the characteristics of the Korean language for a more parameter-efficient tokenization strategy in Neural Machine Translation pipeline, this project considers the compositional nature of Korean syllables. An alphabet-level tokenization is introduced in combination with Byte-Pair Encoding, together with a mitigation strategy to address potential invalidities in the generated sequence. Experimental results demonstrate that the proposed tokenization method show improvements in both BLEU and chrF compared to syllable-based baselines in English-to-Korean translation task. The codebase for this project is available on https://github.com/jylee-k/joeynmt/tree/ token masking.
author2 Wang Lipo
author_facet Wang Lipo
Lee, Junyoung
format Final Year Project
author Lee, Junyoung
author_sort Lee, Junyoung
title Korean jamo-level byte-pair encoding for neural machine translation
title_short Korean jamo-level byte-pair encoding for neural machine translation
title_full Korean jamo-level byte-pair encoding for neural machine translation
title_fullStr Korean jamo-level byte-pair encoding for neural machine translation
title_full_unstemmed Korean jamo-level byte-pair encoding for neural machine translation
title_sort korean jamo-level byte-pair encoding for neural machine translation
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/172737
_version_ 1787136457042821120