Synthetic word embedding generation for downstream NLP task

Distributional word representation such as GloVe and BERT has garnered immense popularity and research interest in recent years due to their success in many downstream NLP applications. However, a major limitation of word embedding is its inability to handle unknown words. To make sense...

Full description

Saved in:
Bibliographic Details
Main Author: Hoang, Viet
Other Authors: Chng Eng Siong
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/153201
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Distributional word representation such as GloVe and BERT has garnered immense popularity and research interest in recent years due to their success in many downstream NLP applications. However, a major limitation of word embedding is its inability to handle unknown words. To make sense of words that were not present in training, current NLP models use sub-word embedding (obtained via sub-word segmentation algorithms). However, this approach often fails to capture the semantic sense of the word due to words being broken down in a syntactic manner. There have been other approaches to tackle embedding of unknown words using Concept Net and Recursive Neural Network, but did not enjoy much usage due to their complexities in design. This report presents a novel solution to generate embedding for OOV using a neural rather than symbolic approach.This approach capitalizes on the existing semantics captured in known words’ embeddings and trains a simple feed-forward neural network to capture the compositionality function of embeddings in their latent space. Linguistic studies have shown that the compositionality function is broad and varied, therefore this report introduces a preliminary study into the compositionality of noun, with focus on certain named entities. The trained net-work is able to generate an embedding for an unknown word based on its context words,which can be obtained via crawling of web data. This synthetic embedding can then be incorporated into the embedding matrix of existing application. From the experiments on GloVe and RoBERTa embeddings, it can be concluded that the synthetic embedding is a feasible light-weight option that can supplement the understanding of many NLP downstream applications due to its ease of synthesis, the quality of the embedding over subword tokens and the short amount of time it takes to generate embedding for unknown words.