Data-driven and NLP for long document learning representation

Natural language processing (NLP) has been advancing at an incredible pace. However, research in long-document representation has not been deeply explored despite its importance. Semantic matching for long documents has many applications including citation and article recommendation. In an increasin...

Full description

Saved in:

Bibliographic Details
Main Author:	Ko, Seoyoon
Other Authors:	Lihui CHEN
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2021
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Electrical and electronic engineering
Online Access:	https://hdl.handle.net/10356/150258
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-150258
record_format	dspace
spelling	sg-ntu-dr.10356-1502582023-07-07T18:20:22Z Data-driven and NLP for long document learning representation Ko, Seoyoon Lihui CHEN School of Electrical and Electronic Engineering ELHCHEN@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Electrical and electronic engineering Natural language processing (NLP) has been advancing at an incredible pace. However, research in long-document representation has not been deeply explored despite its importance. Semantic matching for long documents has many applications including citation and article recommendation. In an increasingly data-driven world today, these applications are becoming an integral part of our society. However, due to the length of long documents, the success in capturing long-document semantics is still a challenge. Recently, Siamese multi-depth attention-based hierarchical (SMASH) model was proposed which considers using document structure to capture semantics in a long document. In this project, a novel Siamese Hierarchical Weight Sharing Transformer (SHWEST) and Siamese Hierarchical Transformer (SHT) are proposed based on SMASH. These models aim to improve the long-document representations using the state-of-the-art transformer encoder architecture. There are three different document representations explored in this report, namely paragraph and sentence level (P+S), paragraph level (P), and sentence level (S). The report aims to determine how effective the different hierarchical document representation for both SHT and SHWEST in capturing the semantics of a long document. Experimental studies have been conducted to compare SHWEST and SHT against SMASH and RNN models on the AAN benchmark dataset. Experiments showed that the SHT and SHWEST models outperform all the baseline models including SMASH for all 3 representations and have better efficiency as it takes a shorter time for all 3 different combinations. Generally, P+S and S representations perform better than P representations. In particular, SHWEST (S) achieves 13.87% higher accuracy against the RNN model while SHWEST (P+S) achieves 12.89% higher accuracy. Moreover, SHWEST outperforms SHT in all aspects as well. SHT performed slightly better than SMASH, but it is more efficient with P+S being at least 2.8 times faster. Furthermore, both SHWEST and SHT have the potential to be further optimized when more computing resources are available. Bachelor of Engineering (Information Engineering and Media) 2021-06-13T07:44:53Z 2021-06-13T07:44:53Z 2021 Final Year Project (FYP) Ko, S. (2021). Data-driven and NLP for long document learning representation. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/150258 https://hdl.handle.net/10356/150258 en A3047-201 application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Electrical and electronic engineering
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Electrical and electronic engineering Ko, Seoyoon Data-driven and NLP for long document learning representation
description	Natural language processing (NLP) has been advancing at an incredible pace. However, research in long-document representation has not been deeply explored despite its importance. Semantic matching for long documents has many applications including citation and article recommendation. In an increasingly data-driven world today, these applications are becoming an integral part of our society. However, due to the length of long documents, the success in capturing long-document semantics is still a challenge. Recently, Siamese multi-depth attention-based hierarchical (SMASH) model was proposed which considers using document structure to capture semantics in a long document. In this project, a novel Siamese Hierarchical Weight Sharing Transformer (SHWEST) and Siamese Hierarchical Transformer (SHT) are proposed based on SMASH. These models aim to improve the long-document representations using the state-of-the-art transformer encoder architecture. There are three different document representations explored in this report, namely paragraph and sentence level (P+S), paragraph level (P), and sentence level (S). The report aims to determine how effective the different hierarchical document representation for both SHT and SHWEST in capturing the semantics of a long document. Experimental studies have been conducted to compare SHWEST and SHT against SMASH and RNN models on the AAN benchmark dataset. Experiments showed that the SHT and SHWEST models outperform all the baseline models including SMASH for all 3 representations and have better efficiency as it takes a shorter time for all 3 different combinations. Generally, P+S and S representations perform better than P representations. In particular, SHWEST (S) achieves 13.87% higher accuracy against the RNN model while SHWEST (P+S) achieves 12.89% higher accuracy. Moreover, SHWEST outperforms SHT in all aspects as well. SHT performed slightly better than SMASH, but it is more efficient with P+S being at least 2.8 times faster. Furthermore, both SHWEST and SHT have the potential to be further optimized when more computing resources are available.
author2	Lihui CHEN
author_facet	Lihui CHEN Ko, Seoyoon
format	Final Year Project
author	Ko, Seoyoon
author_sort	Ko, Seoyoon
title	Data-driven and NLP for long document learning representation
title_short	Data-driven and NLP for long document learning representation
title_full	Data-driven and NLP for long document learning representation
title_fullStr	Data-driven and NLP for long document learning representation
title_full_unstemmed	Data-driven and NLP for long document learning representation
title_sort	data-driven and nlp for long document learning representation
publisher	Nanyang Technological University
publishDate	2021
url	https://hdl.handle.net/10356/150258
_version_	1772828115836338176

Data-driven and NLP for long document learning representation

Similar Items