Natural language processing as autoregressive generation

The advances in deep learning have led to great achievements in many Natural Language Processing (NLP) tasks. With the nature of language, i.e., sequential data, most NLP tasks can be framed into the sequence learning framework, such as text generation. As one of the most important foundations for m...

Full description

Saved in:
Bibliographic Details
Main Author: Lin, Xiang
Other Authors: Joty Shafiq Rayhan
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/168487
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-168487
record_format dspace
spelling sg-ntu-dr.10356-1684872023-07-04T01:52:12Z Natural language processing as autoregressive generation Lin, Xiang Joty Shafiq Rayhan School of Computer Science and Engineering srjoty@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Document and text processing The advances in deep learning have led to great achievements in many Natural Language Processing (NLP) tasks. With the nature of language, i.e., sequential data, most NLP tasks can be framed into the sequence learning framework, such as text generation. As one of the most important foundations for modern NLP techniques, autoregressive generation models have achieved dominant performance in a great deal of NLP tasks. Therefore, this thesis emphasizes improving the autoregressive generation model for different NLP tasks. While many tasks can naturally fit into the sequence learning framework, some of them, e.g., building discourse parsing tree, require sophisticated designs to fit into neural models. Therefore, this thesis firstly emphasizes a novel unified framework for discourse parsing, which builds a discourse tree in a top-down depth-first manner, and it frames the task as an autoregressive generation task with the goal of each step being the prediction of the node position given a piece of text. The proposed approach is proven effective with extensive empirical experiments. In addition, I extend the above framework by proposing a hierarchical decoder, which leverages the information from parents and siblings of the nodes that are currently processed. The proposed decoder utilizes the nature of the tree structure and further improves the experiment performance on both discourse parsing and dependency parsing tasks. On the other hand, the de facto strategies, i.e., cross entropy loss and teacher forcing, for training the autoregressive generation models have been shown problematic in certain aspects. For example, cross entropy loss, which is one of the widely leveraged training objective functions, often leads to text degeneration in text generation, and teacher forcing suffers from the exposure bias problem, where there exists a mismatch between the training and testing setup. For text degeneration, I introduce a class of diminishing attentions, which enforces the submodularity of the coverage calculated by cross attention in the sequence-to-sequence model. The proposed diminishing attentions achieve notable improvement on several neural text generation tasks, including text summarization, machine translation, and image paragraph generation. Further, I propose a novel training objective, ScaleGrad, to replace cross entropy, which significantly reduces the degeneration problem in different text generation tasks. In fact, ScaleGrad can be extended to problems beyond text degeneration. It provides wide flexibility to inject different inductive biases into the text generation model by directly modifying the gradient information in the output layer. Next, for the exposure bias problem, this thesis introduces a novel type of scheduled sampling based on training accuracy, which requires only minimal hyper-parameter tuning compared to existing scheduled sampling methods. Additionally, a novel imitation loss is proposed to further enforce the model’s generative behavior to match the teacher-forced behavior. Moreover, this thesis demonstrates that reducing exposure bias can improve the robustness of language models against repetition and toxic errors. Doctor of Philosophy 2023-06-05T02:35:57Z 2023-06-05T02:35:57Z 2023 Thesis-Doctor of Philosophy Lin, X. (2023). Natural language processing as autoregressive generation. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/168487 https://hdl.handle.net/10356/168487 10.32657/10356/168487 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Document and text processing
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Lin, Xiang
Natural language processing as autoregressive generation
description The advances in deep learning have led to great achievements in many Natural Language Processing (NLP) tasks. With the nature of language, i.e., sequential data, most NLP tasks can be framed into the sequence learning framework, such as text generation. As one of the most important foundations for modern NLP techniques, autoregressive generation models have achieved dominant performance in a great deal of NLP tasks. Therefore, this thesis emphasizes improving the autoregressive generation model for different NLP tasks. While many tasks can naturally fit into the sequence learning framework, some of them, e.g., building discourse parsing tree, require sophisticated designs to fit into neural models. Therefore, this thesis firstly emphasizes a novel unified framework for discourse parsing, which builds a discourse tree in a top-down depth-first manner, and it frames the task as an autoregressive generation task with the goal of each step being the prediction of the node position given a piece of text. The proposed approach is proven effective with extensive empirical experiments. In addition, I extend the above framework by proposing a hierarchical decoder, which leverages the information from parents and siblings of the nodes that are currently processed. The proposed decoder utilizes the nature of the tree structure and further improves the experiment performance on both discourse parsing and dependency parsing tasks. On the other hand, the de facto strategies, i.e., cross entropy loss and teacher forcing, for training the autoregressive generation models have been shown problematic in certain aspects. For example, cross entropy loss, which is one of the widely leveraged training objective functions, often leads to text degeneration in text generation, and teacher forcing suffers from the exposure bias problem, where there exists a mismatch between the training and testing setup. For text degeneration, I introduce a class of diminishing attentions, which enforces the submodularity of the coverage calculated by cross attention in the sequence-to-sequence model. The proposed diminishing attentions achieve notable improvement on several neural text generation tasks, including text summarization, machine translation, and image paragraph generation. Further, I propose a novel training objective, ScaleGrad, to replace cross entropy, which significantly reduces the degeneration problem in different text generation tasks. In fact, ScaleGrad can be extended to problems beyond text degeneration. It provides wide flexibility to inject different inductive biases into the text generation model by directly modifying the gradient information in the output layer. Next, for the exposure bias problem, this thesis introduces a novel type of scheduled sampling based on training accuracy, which requires only minimal hyper-parameter tuning compared to existing scheduled sampling methods. Additionally, a novel imitation loss is proposed to further enforce the model’s generative behavior to match the teacher-forced behavior. Moreover, this thesis demonstrates that reducing exposure bias can improve the robustness of language models against repetition and toxic errors.
author2 Joty Shafiq Rayhan
author_facet Joty Shafiq Rayhan
Lin, Xiang
format Thesis-Doctor of Philosophy
author Lin, Xiang
author_sort Lin, Xiang
title Natural language processing as autoregressive generation
title_short Natural language processing as autoregressive generation
title_full Natural language processing as autoregressive generation
title_fullStr Natural language processing as autoregressive generation
title_full_unstemmed Natural language processing as autoregressive generation
title_sort natural language processing as autoregressive generation
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/168487
_version_ 1772825551395880960