DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH

Abstract Meaning Representation (AMR) is a semantic representation of a single sentence. Document Abstract Meaning Representation (DocAMR) expands the AMR function so that it can represent many sentences or a single document. DocAMR is obtained from a graph that combines the AMR of individual sen...

Full description

Saved in:

Bibliographic Details
Main Author:	Widianto, Adi
Format:	Theses
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/80131
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:80131
spelling	id-itb.:801312024-01-18T15:47:10ZDOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH Widianto, Adi Indonesia Theses Document Similarity, Abstract Meaning Representation, DocAMR, Transition-based Neural Parser, DocSmatch INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/80131 Abstract Meaning Representation (AMR) is a semantic representation of a single sentence. Document Abstract Meaning Representation (DocAMR) expands the AMR function so that it can represent many sentences or a single document. DocAMR is obtained from a graph that combines the AMR of individual sentences and coreference annotations between sentences. The DocAMR generator has been developed with various machine learning methods. However, there is no real-world application of DocAMR or document-level AMR graphs in natural language processing tasks such as document similarity. Document similarity (or the distance between documents) is one of the natural language processing tasks, especially in information retrieval. This task calculates how similar a document is to other documents. Document similarity applications include dataset exploration and document recommendation. Document representation techniques can be word-based (lexicon) or semantic-based. Document similarity research using word-based document representations such as bag of words, Latent Dirichlet Allocation (LDA), and paragraph vectors has been carried out. However, there has been no research explaining the influence of document-level AMR graph-based semantic representation on document similarity. The design of a document similarity model based on document level AMR graphs as a document representation was carried out in this research. Document similarity triplets dataset v1.0, sub-dataset hand-built Wikipedia triplet as test data, taken by applying preprocessing in the form of text downloading, text cutting, text cleaning, and sentence segmentation. The AMR graph per sentence is generated using a pretrained Transition-based Neural Parser model. Combining AMR to obtain document level AMR is carried out using 3 methods; sentence conjunctions, concept merging, and DocAMR. The similarity between documents is calculated using the Smatch score resulting from the DocSmatch process. Testing the document similarity model using the DocAMR graph representation gave an accuracy result of 65.6976%, where this result was greater than the baseline which used AMR sentence conjunction combination per sentence, namely 65.1162%. But the concept merging comparison method beats DocAMR with a value of 77.9069%. Prediction errors occur because pieces of text taken from iv articles do not have sufficient concepts or information to represent the topic and proximity of the document. The significant difference in prediction accuracy in each test indicates that the Smatch value in the pair of documents being compared has a small difference, so it is recommended that further research use another graph- based document similarity calculation approach such as Graph Edit Distance or Jaccard Similarity for graphs. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Abstract Meaning Representation (AMR) is a semantic representation of a single sentence. Document Abstract Meaning Representation (DocAMR) expands the AMR function so that it can represent many sentences or a single document. DocAMR is obtained from a graph that combines the AMR of individual sentences and coreference annotations between sentences. The DocAMR generator has been developed with various machine learning methods. However, there is no real-world application of DocAMR or document-level AMR graphs in natural language processing tasks such as document similarity. Document similarity (or the distance between documents) is one of the natural language processing tasks, especially in information retrieval. This task calculates how similar a document is to other documents. Document similarity applications include dataset exploration and document recommendation. Document representation techniques can be word-based (lexicon) or semantic-based. Document similarity research using word-based document representations such as bag of words, Latent Dirichlet Allocation (LDA), and paragraph vectors has been carried out. However, there has been no research explaining the influence of document-level AMR graph-based semantic representation on document similarity. The design of a document similarity model based on document level AMR graphs as a document representation was carried out in this research. Document similarity triplets dataset v1.0, sub-dataset hand-built Wikipedia triplet as test data, taken by applying preprocessing in the form of text downloading, text cutting, text cleaning, and sentence segmentation. The AMR graph per sentence is generated using a pretrained Transition-based Neural Parser model. Combining AMR to obtain document level AMR is carried out using 3 methods; sentence conjunctions, concept merging, and DocAMR. The similarity between documents is calculated using the Smatch score resulting from the DocSmatch process. Testing the document similarity model using the DocAMR graph representation gave an accuracy result of 65.6976%, where this result was greater than the baseline which used AMR sentence conjunction combination per sentence, namely 65.1162%. But the concept merging comparison method beats DocAMR with a value of 77.9069%. Prediction errors occur because pieces of text taken from iv articles do not have sufficient concepts or information to represent the topic and proximity of the document. The significant difference in prediction accuracy in each test indicates that the Smatch value in the pair of documents being compared has a small difference, so it is recommended that further research use another graph- based document similarity calculation approach such as Graph Edit Distance or Jaccard Similarity for graphs.
format	Theses
author	Widianto, Adi
spellingShingle	Widianto, Adi DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH
author_facet	Widianto, Adi
author_sort	Widianto, Adi
title	DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH
title_short	DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH
title_full	DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH
title_fullStr	DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH
title_full_unstemmed	DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH
title_sort	document similarity using document level abstract meaning representation graph
url	https://digilib.itb.ac.id/gdl/view/80131
_version_	1822996678563594240

DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH

Similar Items