Natural language processing as autoregressive generation

The advances in deep learning have led to great achievements in many Natural Language Processing (NLP) tasks. With the nature of language, i.e., sequential data, most NLP tasks can be framed into the sequence learning framework, such as text generation. As one of the most important foundations for m...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Lin, Xiang
مؤلفون آخرون:	Joty Shafiq Rayhan
التنسيق:	Thesis-Doctor of Philosophy
اللغة:	English
منشور في:	Nanyang Technological University 2023
الموضوعات:	Engineering::Computer science and engineering::Computing methodologies::Document and text processing
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/168487
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

id	sg-ntu-dr.10356-168487
record_format	dspace
spelling	sg-ntu-dr.10356-1684872023-07-04T01:52:12Z Natural language processing as autoregressive generation Lin, Xiang Joty Shafiq Rayhan School of Computer Science and Engineering srjoty@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Document and text processing The advances in deep learning have led to great achievements in many Natural Language Processing (NLP) tasks. With the nature of language, i.e., sequential data, most NLP tasks can be framed into the sequence learning framework, such as text generation. As one of the most important foundations for modern NLP techniques, autoregressive generation models have achieved dominant performance in a great deal of NLP tasks. Therefore, this thesis emphasizes improving the autoregressive generation model for different NLP tasks. While many tasks can naturally fit into the sequence learning framework, some of them, e.g., building discourse parsing tree, require sophisticated designs to fit into neural models. Therefore, this thesis firstly emphasizes a novel unified framework for discourse parsing, which builds a discourse tree in a top-down depth-first manner, and it frames the task as an autoregressive generation task with the goal of each step being the prediction of the node position given a piece of text. The proposed approach is proven effective with extensive empirical experiments. In addition, I extend the above framework by proposing a hierarchical decoder, which leverages the information from parents and siblings of the nodes that are currently processed. The proposed decoder utilizes the nature of the tree structure and further improves the experiment performance on both discourse parsing and dependency parsing tasks. On the other hand, the de facto strategies, i.e., cross entropy loss and teacher forcing, for training the autoregressive generation models have been shown problematic in certain aspects. For example, cross entropy loss, which is one of the widely leveraged training objective functions, often leads to text degeneration in text generation, and teacher forcing suffers from the exposure bias problem, where there exists a mismatch between the training and testing setup. For text degeneration, I introduce a class of diminishing attentions, which enforces the submodularity of the coverage calculated by cross attention in the sequence-to-sequence model. The proposed diminishing attentions achieve notable improvement on several neural text generation tasks, including text summarization, machine translation, and image paragraph generation. Further, I propose a novel training objective, ScaleGrad, to replace cross entropy, which significantly reduces the degeneration problem in different text generation tasks. In fact, ScaleGrad can be extended to problems beyond text degeneration. It provides wide flexibility to inject different inductive biases into the text generation model by directly modifying the gradient information in the output layer. Next, for the exposure bias problem, this thesis introduces a novel type of scheduled sampling based on training accuracy, which requires only minimal hyper-parameter tuning compared to existing scheduled sampling methods. Additionally, a novel imitation loss is proposed to further enforce the model’s generative behavior to match the teacher-forced behavior. Moreover, this thesis demonstrates that reducing exposure bias can improve the robustness of language models against repetition and toxic errors. Doctor of Philosophy 2023-06-05T02:35:57Z 2023-06-05T02:35:57Z 2023 Thesis-Doctor of Philosophy Lin, X. (2023). Natural language processing as autoregressive generation. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/168487 https://hdl.handle.net/10356/168487 10.32657/10356/168487 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Document and text processing
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Document and text processing Lin, Xiang Natural language processing as autoregressive generation
description	The advances in deep learning have led to great achievements in many Natural Language Processing (NLP) tasks. With the nature of language, i.e., sequential data, most NLP tasks can be framed into the sequence learning framework, such as text generation. As one of the most important foundations for modern NLP techniques, autoregressive generation models have achieved dominant performance in a great deal of NLP tasks. Therefore, this thesis emphasizes improving the autoregressive generation model for different NLP tasks. While many tasks can naturally fit into the sequence learning framework, some of them, e.g., building discourse parsing tree, require sophisticated designs to fit into neural models. Therefore, this thesis firstly emphasizes a novel unified framework for discourse parsing, which builds a discourse tree in a top-down depth-first manner, and it frames the task as an autoregressive generation task with the goal of each step being the prediction of the node position given a piece of text. The proposed approach is proven effective with extensive empirical experiments. In addition, I extend the above framework by proposing a hierarchical decoder, which leverages the information from parents and siblings of the nodes that are currently processed. The proposed decoder utilizes the nature of the tree structure and further improves the experiment performance on both discourse parsing and dependency parsing tasks. On the other hand, the de facto strategies, i.e., cross entropy loss and teacher forcing, for training the autoregressive generation models have been shown problematic in certain aspects. For example, cross entropy loss, which is one of the widely leveraged training objective functions, often leads to text degeneration in text generation, and teacher forcing suffers from the exposure bias problem, where there exists a mismatch between the training and testing setup. For text degeneration, I introduce a class of diminishing attentions, which enforces the submodularity of the coverage calculated by cross attention in the sequence-to-sequence model. The proposed diminishing attentions achieve notable improvement on several neural text generation tasks, including text summarization, machine translation, and image paragraph generation. Further, I propose a novel training objective, ScaleGrad, to replace cross entropy, which significantly reduces the degeneration problem in different text generation tasks. In fact, ScaleGrad can be extended to problems beyond text degeneration. It provides wide flexibility to inject different inductive biases into the text generation model by directly modifying the gradient information in the output layer. Next, for the exposure bias problem, this thesis introduces a novel type of scheduled sampling based on training accuracy, which requires only minimal hyper-parameter tuning compared to existing scheduled sampling methods. Additionally, a novel imitation loss is proposed to further enforce the model’s generative behavior to match the teacher-forced behavior. Moreover, this thesis demonstrates that reducing exposure bias can improve the robustness of language models against repetition and toxic errors.
author2	Joty Shafiq Rayhan
author_facet	Joty Shafiq Rayhan Lin, Xiang
format	Thesis-Doctor of Philosophy
author	Lin, Xiang
author_sort	Lin, Xiang
title	Natural language processing as autoregressive generation
title_short	Natural language processing as autoregressive generation
title_full	Natural language processing as autoregressive generation
title_fullStr	Natural language processing as autoregressive generation
title_full_unstemmed	Natural language processing as autoregressive generation
title_sort	natural language processing as autoregressive generation
publisher	Nanyang Technological University
publishDate	2023
url	https://hdl.handle.net/10356/168487
_version_	1772825551395880960

Natural language processing as autoregressive generation

مواد مشابهة