Synthetic word embedding generation for downstream NLP task

Distributional word representation such as GloVe and BERT has garnered immense popularity and research interest in recent years due to their success in many downstream NLP applications. However, a major limitation of word embedding is its inability to handle unknown words. To make sense...

Full description

Saved in:
Bibliographic Details
Main Author: Hoang, Viet
Other Authors: Chng Eng Siong
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/153201
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-153201
record_format dspace
spelling sg-ntu-dr.10356-1532012021-11-16T05:17:58Z Synthetic word embedding generation for downstream NLP task Hoang, Viet Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Distributional word representation such as GloVe and BERT has garnered immense popularity and research interest in recent years due to their success in many downstream NLP applications. However, a major limitation of word embedding is its inability to handle unknown words. To make sense of words that were not present in training, current NLP models use sub-word embedding (obtained via sub-word segmentation algorithms). However, this approach often fails to capture the semantic sense of the word due to words being broken down in a syntactic manner. There have been other approaches to tackle embedding of unknown words using Concept Net and Recursive Neural Network, but did not enjoy much usage due to their complexities in design. This report presents a novel solution to generate embedding for OOV using a neural rather than symbolic approach.This approach capitalizes on the existing semantics captured in known words’ embeddings and trains a simple feed-forward neural network to capture the compositionality function of embeddings in their latent space. Linguistic studies have shown that the compositionality function is broad and varied, therefore this report introduces a preliminary study into the compositionality of noun, with focus on certain named entities. The trained net-work is able to generate an embedding for an unknown word based on its context words,which can be obtained via crawling of web data. This synthetic embedding can then be incorporated into the embedding matrix of existing application. From the experiments on GloVe and RoBERTa embeddings, it can be concluded that the synthetic embedding is a feasible light-weight option that can supplement the understanding of many NLP downstream applications due to its ease of synthesis, the quality of the embedding over subword tokens and the short amount of time it takes to generate embedding for unknown words. Bachelor of Engineering (Computer Engineering) 2021-11-16T02:43:17Z 2021-11-16T02:43:17Z 2021 Final Year Project (FYP) Hoang, V. (2021). Synthetic word embedding generation for downstream NLP task. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153201 https://hdl.handle.net/10356/153201 en SCSE20-0856 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Hoang, Viet
Synthetic word embedding generation for downstream NLP task
description Distributional word representation such as GloVe and BERT has garnered immense popularity and research interest in recent years due to their success in many downstream NLP applications. However, a major limitation of word embedding is its inability to handle unknown words. To make sense of words that were not present in training, current NLP models use sub-word embedding (obtained via sub-word segmentation algorithms). However, this approach often fails to capture the semantic sense of the word due to words being broken down in a syntactic manner. There have been other approaches to tackle embedding of unknown words using Concept Net and Recursive Neural Network, but did not enjoy much usage due to their complexities in design. This report presents a novel solution to generate embedding for OOV using a neural rather than symbolic approach.This approach capitalizes on the existing semantics captured in known words’ embeddings and trains a simple feed-forward neural network to capture the compositionality function of embeddings in their latent space. Linguistic studies have shown that the compositionality function is broad and varied, therefore this report introduces a preliminary study into the compositionality of noun, with focus on certain named entities. The trained net-work is able to generate an embedding for an unknown word based on its context words,which can be obtained via crawling of web data. This synthetic embedding can then be incorporated into the embedding matrix of existing application. From the experiments on GloVe and RoBERTa embeddings, it can be concluded that the synthetic embedding is a feasible light-weight option that can supplement the understanding of many NLP downstream applications due to its ease of synthesis, the quality of the embedding over subword tokens and the short amount of time it takes to generate embedding for unknown words.
author2 Chng Eng Siong
author_facet Chng Eng Siong
Hoang, Viet
format Final Year Project
author Hoang, Viet
author_sort Hoang, Viet
title Synthetic word embedding generation for downstream NLP task
title_short Synthetic word embedding generation for downstream NLP task
title_full Synthetic word embedding generation for downstream NLP task
title_fullStr Synthetic word embedding generation for downstream NLP task
title_full_unstemmed Synthetic word embedding generation for downstream NLP task
title_sort synthetic word embedding generation for downstream nlp task
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/153201
_version_ 1718368070429310976