DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH

Abstract Meaning Representation (AMR) is a semantic representation of a single sentence. Document Abstract Meaning Representation (DocAMR) expands the AMR function so that it can represent many sentences or a single document. DocAMR is obtained from a graph that combines the AMR of individual sen...

Full description

Saved in:
Bibliographic Details
Main Author: Widianto, Adi
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/80131
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:80131
spelling id-itb.:801312024-01-18T15:47:10ZDOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH Widianto, Adi Indonesia Theses Document Similarity, Abstract Meaning Representation, DocAMR, Transition-based Neural Parser, DocSmatch INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/80131 Abstract Meaning Representation (AMR) is a semantic representation of a single sentence. Document Abstract Meaning Representation (DocAMR) expands the AMR function so that it can represent many sentences or a single document. DocAMR is obtained from a graph that combines the AMR of individual sentences and coreference annotations between sentences. The DocAMR generator has been developed with various machine learning methods. However, there is no real-world application of DocAMR or document-level AMR graphs in natural language processing tasks such as document similarity. Document similarity (or the distance between documents) is one of the natural language processing tasks, especially in information retrieval. This task calculates how similar a document is to other documents. Document similarity applications include dataset exploration and document recommendation. Document representation techniques can be word-based (lexicon) or semantic-based. Document similarity research using word-based document representations such as bag of words, Latent Dirichlet Allocation (LDA), and paragraph vectors has been carried out. However, there has been no research explaining the influence of document-level AMR graph-based semantic representation on document similarity. The design of a document similarity model based on document level AMR graphs as a document representation was carried out in this research. Document similarity triplets dataset v1.0, sub-dataset hand-built Wikipedia triplet as test data, taken by applying preprocessing in the form of text downloading, text cutting, text cleaning, and sentence segmentation. The AMR graph per sentence is generated using a pretrained Transition-based Neural Parser model. Combining AMR to obtain document level AMR is carried out using 3 methods; sentence conjunctions, concept merging, and DocAMR. The similarity between documents is calculated using the Smatch score resulting from the DocSmatch process. Testing the document similarity model using the DocAMR graph representation gave an accuracy result of 65.6976%, where this result was greater than the baseline which used AMR sentence conjunction combination per sentence, namely 65.1162%. But the concept merging comparison method beats DocAMR with a value of 77.9069%. Prediction errors occur because pieces of text taken from iv articles do not have sufficient concepts or information to represent the topic and proximity of the document. The significant difference in prediction accuracy in each test indicates that the Smatch value in the pair of documents being compared has a small difference, so it is recommended that further research use another graph- based document similarity calculation approach such as Graph Edit Distance or Jaccard Similarity for graphs. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Abstract Meaning Representation (AMR) is a semantic representation of a single sentence. Document Abstract Meaning Representation (DocAMR) expands the AMR function so that it can represent many sentences or a single document. DocAMR is obtained from a graph that combines the AMR of individual sentences and coreference annotations between sentences. The DocAMR generator has been developed with various machine learning methods. However, there is no real-world application of DocAMR or document-level AMR graphs in natural language processing tasks such as document similarity. Document similarity (or the distance between documents) is one of the natural language processing tasks, especially in information retrieval. This task calculates how similar a document is to other documents. Document similarity applications include dataset exploration and document recommendation. Document representation techniques can be word-based (lexicon) or semantic-based. Document similarity research using word-based document representations such as bag of words, Latent Dirichlet Allocation (LDA), and paragraph vectors has been carried out. However, there has been no research explaining the influence of document-level AMR graph-based semantic representation on document similarity. The design of a document similarity model based on document level AMR graphs as a document representation was carried out in this research. Document similarity triplets dataset v1.0, sub-dataset hand-built Wikipedia triplet as test data, taken by applying preprocessing in the form of text downloading, text cutting, text cleaning, and sentence segmentation. The AMR graph per sentence is generated using a pretrained Transition-based Neural Parser model. Combining AMR to obtain document level AMR is carried out using 3 methods; sentence conjunctions, concept merging, and DocAMR. The similarity between documents is calculated using the Smatch score resulting from the DocSmatch process. Testing the document similarity model using the DocAMR graph representation gave an accuracy result of 65.6976%, where this result was greater than the baseline which used AMR sentence conjunction combination per sentence, namely 65.1162%. But the concept merging comparison method beats DocAMR with a value of 77.9069%. Prediction errors occur because pieces of text taken from iv articles do not have sufficient concepts or information to represent the topic and proximity of the document. The significant difference in prediction accuracy in each test indicates that the Smatch value in the pair of documents being compared has a small difference, so it is recommended that further research use another graph- based document similarity calculation approach such as Graph Edit Distance or Jaccard Similarity for graphs.
format Theses
author Widianto, Adi
spellingShingle Widianto, Adi
DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH
author_facet Widianto, Adi
author_sort Widianto, Adi
title DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH
title_short DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH
title_full DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH
title_fullStr DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH
title_full_unstemmed DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH
title_sort document similarity using document level abstract meaning representation graph
url https://digilib.itb.ac.id/gdl/view/80131
_version_ 1822996678563594240