Deep metric based feature engineering to Improve document-level representation for document clustering

Document-level representation attracts more and more research attention. Recent Transformer-based pretrained language models (PLMs) like BERT learn powerful textual representations. These models are originally and inherently designed for word-level tasks, which limits their maximum input length. Cur...

Full description

Saved in:

Bibliographic Details
Main Author:	Xu, Liwen
Other Authors:	Lihui Chen
Format:	Thesis-Master by Coursework
Language:	English
Published:	Nanyang Technological University 2022
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Online Access:	https://hdl.handle.net/10356/163261
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-163261
record_format	dspace
spelling	sg-ntu-dr.10356-1632612022-11-30T02:23:43Z Deep metric based feature engineering to Improve document-level representation for document clustering Xu, Liwen Lihui Chen School of Electrical and Electronic Engineering ELHCHEN@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Document and text processing Document-level representation attracts more and more research attention. Recent Transformer-based pretrained language models (PLMs) like BERT learn powerful textual representations. These models are originally and inherently designed for word-level tasks, which limits their maximum input length. Current document-level approaches accommodate this limitation through various ways. Some of them consider the concatenation of the title and the abstract only as the input to the PLM, which neglects the rich inherent semantic information within the main page. Other approaches try to obtain document-level representations by encoding multiple sentences in a document and concatenating them directly. However, the acquired representation may be too redundant, and the training and inference process are computationally heavy for real-world applications. To alleviate the two drawbacks, we decompose the process from word-level to document-level into a two-stage feature engineering. In the first stage, the sentence-level representations of each sentence in a document is extracted by a PLM from word-level tokens. Then they are concatenated into a document matrix. In the second stage, document matrixs with the semantic information of all text within documents are fed into a CNN model to obtain document-level representations with the dimension reduced 24 times. The model is optimized by a deep metric representation learning objective. Extensive experiments are conducted for hyper-parameter tuning and model design, and for the comparison among different deep metric representation learning objectives. Master of Science (Signal Processing) 2022-11-30T02:23:42Z 2022-11-30T02:23:42Z 2022 Thesis-Master by Coursework Xu, L. (2022). Deep metric based feature engineering to Improve document-level representation for document clustering. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/163261 https://hdl.handle.net/10356/163261 en application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Document and text processing
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Document and text processing Xu, Liwen Deep metric based feature engineering to Improve document-level representation for document clustering
description	Document-level representation attracts more and more research attention. Recent Transformer-based pretrained language models (PLMs) like BERT learn powerful textual representations. These models are originally and inherently designed for word-level tasks, which limits their maximum input length. Current document-level approaches accommodate this limitation through various ways. Some of them consider the concatenation of the title and the abstract only as the input to the PLM, which neglects the rich inherent semantic information within the main page. Other approaches try to obtain document-level representations by encoding multiple sentences in a document and concatenating them directly. However, the acquired representation may be too redundant, and the training and inference process are computationally heavy for real-world applications. To alleviate the two drawbacks, we decompose the process from word-level to document-level into a two-stage feature engineering. In the first stage, the sentence-level representations of each sentence in a document is extracted by a PLM from word-level tokens. Then they are concatenated into a document matrix. In the second stage, document matrixs with the semantic information of all text within documents are fed into a CNN model to obtain document-level representations with the dimension reduced 24 times. The model is optimized by a deep metric representation learning objective. Extensive experiments are conducted for hyper-parameter tuning and model design, and for the comparison among different deep metric representation learning objectives.
author2	Lihui Chen
author_facet	Lihui Chen Xu, Liwen
format	Thesis-Master by Coursework
author	Xu, Liwen
author_sort	Xu, Liwen
title	Deep metric based feature engineering to Improve document-level representation for document clustering
title_short	Deep metric based feature engineering to Improve document-level representation for document clustering
title_full	Deep metric based feature engineering to Improve document-level representation for document clustering
title_fullStr	Deep metric based feature engineering to Improve document-level representation for document clustering
title_full_unstemmed	Deep metric based feature engineering to Improve document-level representation for document clustering
title_sort	deep metric based feature engineering to improve document-level representation for document clustering
publisher	Nanyang Technological University
publishDate	2022
url	https://hdl.handle.net/10356/163261
_version_	1751548517975851008

Deep metric based feature engineering to Improve document-level representation for document clustering

Similar Items