Synthetic word embedding generation for downstream NLP task
Distributional word representation such as GloVe and BERT has garnered immense popularity and research interest in recent years due to their success in many downstream NLP applications. However, a major limitation of word embedding is its inability to handle unknown words. To make sense...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/153201 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Distributional word representation such as GloVe and BERT has garnered immense popularity and research interest in recent years due to their success in many downstream NLP applications. However, a major limitation of word embedding is its inability to handle unknown words. To make sense of words that were not present in training, current NLP models use sub-word embedding (obtained via sub-word segmentation algorithms). However, this approach often fails to capture the semantic sense of the word due to words being broken down in a syntactic manner. There have been other approaches to tackle embedding of unknown words using Concept Net and Recursive Neural Network, but did not enjoy much usage due to their complexities in design. This report presents a novel solution to generate embedding for OOV using a neural rather than symbolic approach.This approach capitalizes on the existing semantics captured in known words’ embeddings and trains a simple feed-forward neural network to capture the compositionality function of embeddings in their latent space. Linguistic studies have shown that the compositionality function is broad and varied, therefore this report introduces a preliminary study into the compositionality of noun, with focus on certain named entities. The trained net-work is able to generate an embedding for an unknown word based on its context words,which can be obtained via crawling of web data. This synthetic embedding can then be incorporated into the embedding matrix of existing application. From the experiments on GloVe and RoBERTa embeddings, it can be concluded that the synthetic embedding is a feasible light-weight option that can supplement the understanding of many NLP downstream applications due to its ease of synthesis, the quality of the embedding over subword tokens and the short amount of time it takes to generate embedding for unknown words. |
---|