Korean jamo-level byte-pair encoding for neural machine translation
Tokenization is the very first step in most Natural Language Processing tasks, and is essential in addressing the fundamental out-of-vocabulary problem, as well as in changing the linguistic understanding. To exploit the characteristics of the Korean language for a more parameter-efficient tokenizat...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/172737 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-172737 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1727372023-12-22T15:42:12Z Korean jamo-level byte-pair encoding for neural machine translation Lee, Junyoung Wang Lipo School of Electrical and Electronic Engineering Tokyo Institute of Technology ELPWang@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Document and text processing Tokenization is the very first step in most Natural Language Processing tasks, and is essential in addressing the fundamental out-of-vocabulary problem, as well as in changing the linguistic understanding. To exploit the characteristics of the Korean language for a more parameter-efficient tokenization strategy in Neural Machine Translation pipeline, this project considers the compositional nature of Korean syllables. An alphabet-level tokenization is introduced in combination with Byte-Pair Encoding, together with a mitigation strategy to address potential invalidities in the generated sequence. Experimental results demonstrate that the proposed tokenization method show improvements in both BLEU and chrF compared to syllable-based baselines in English-to-Korean translation task. The codebase for this project is available on https://github.com/jylee-k/joeynmt/tree/ token masking. Bachelor of Engineering (Electrical and Electronic Engineering) 2023-12-19T07:01:11Z 2023-12-19T07:01:11Z 2023 Final Year Project (FYP) Lee, J. (2023). Korean jamo-level byte-pair encoding for neural machine translation. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/172737 https://hdl.handle.net/10356/172737 en application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Document and text processing |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Document and text processing Lee, Junyoung Korean jamo-level byte-pair encoding for neural machine translation |
description |
Tokenization is the very first step in most Natural Language Processing tasks, and is essential in addressing the fundamental out-of-vocabulary problem, as well as in changing the linguistic understanding. To exploit the characteristics of the Korean language for a more parameter-efficient tokenization strategy in Neural Machine Translation pipeline, this project considers the compositional nature of Korean syllables. An alphabet-level tokenization is introduced in combination with Byte-Pair Encoding, together with a mitigation strategy to address potential invalidities in the generated sequence. Experimental results demonstrate that the proposed tokenization method show improvements in both BLEU and chrF compared to syllable-based baselines in English-to-Korean translation task.
The codebase for this project is available on https://github.com/jylee-k/joeynmt/tree/ token masking. |
author2 |
Wang Lipo |
author_facet |
Wang Lipo Lee, Junyoung |
format |
Final Year Project |
author |
Lee, Junyoung |
author_sort |
Lee, Junyoung |
title |
Korean jamo-level byte-pair encoding for neural machine translation |
title_short |
Korean jamo-level byte-pair encoding for neural machine translation |
title_full |
Korean jamo-level byte-pair encoding for neural machine translation |
title_fullStr |
Korean jamo-level byte-pair encoding for neural machine translation |
title_full_unstemmed |
Korean jamo-level byte-pair encoding for neural machine translation |
title_sort |
korean jamo-level byte-pair encoding for neural machine translation |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/172737 |
_version_ |
1787136457042821120 |