Korean jamo-level byte-pair encoding for neural machine translation

Tokenization is the very first step in most Natural Language Processing tasks, and is essential in addressing the fundamental out-of-vocabulary problem, as well as in changing the linguistic understanding. To exploit the characteristics of the Korean language for a more parameter-efficient tokenizat...

Full description

Saved in:

Bibliographic Details
Main Author:	Lee, Junyoung
Other Authors:	Wang Lipo
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2023
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Online Access:	https://hdl.handle.net/10356/172737
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

Description
Summary:	Tokenization is the very first step in most Natural Language Processing tasks, and is essential in addressing the fundamental out-of-vocabulary problem, as well as in changing the linguistic understanding. To exploit the characteristics of the Korean language for a more parameter-efficient tokenization strategy in Neural Machine Translation pipeline, this project considers the compositional nature of Korean syllables. An alphabet-level tokenization is introduced in combination with Byte-Pair Encoding, together with a mitigation strategy to address potential invalidities in the generated sequence. Experimental results demonstrate that the proposed tokenization method show improvements in both BLEU and chrF compared to syllable-based baselines in English-to-Korean translation task. The codebase for this project is available on https://github.com/jylee-k/joeynmt/tree/ token masking.

Korean jamo-level byte-pair encoding for neural machine translation

Similar Items