DOCUMENT SIMILARITY USING DOCUMENT LEVEL ABSTRACT MEANING REPRESENTATION GRAPH
Abstract Meaning Representation (AMR) is a semantic representation of a single sentence. Document Abstract Meaning Representation (DocAMR) expands the AMR function so that it can represent many sentences or a single document. DocAMR is obtained from a graph that combines the AMR of individual sen...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/80131 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Abstract Meaning Representation (AMR) is a semantic representation of a single
sentence. Document Abstract Meaning Representation (DocAMR) expands the
AMR function so that it can represent many sentences or a single document.
DocAMR is obtained from a graph that combines the AMR of individual sentences
and coreference annotations between sentences. The DocAMR generator has been
developed with various machine learning methods. However, there is no real-world
application of DocAMR or document-level AMR graphs in natural language
processing tasks such as document similarity.
Document similarity (or the distance between documents) is one of the natural
language processing tasks, especially in information retrieval. This task calculates
how similar a document is to other documents. Document similarity applications
include dataset exploration and document recommendation. Document
representation techniques can be word-based (lexicon) or semantic-based.
Document similarity research using word-based document representations such as
bag of words, Latent Dirichlet Allocation (LDA), and paragraph vectors has been
carried out. However, there has been no research explaining the influence of
document-level AMR graph-based semantic representation on document similarity.
The design of a document similarity model based on document level AMR graphs
as a document representation was carried out in this research. Document similarity
triplets dataset v1.0, sub-dataset hand-built Wikipedia triplet as test data, taken by
applying preprocessing in the form of text downloading, text cutting, text cleaning,
and sentence segmentation. The AMR graph per sentence is generated using a
pretrained Transition-based Neural Parser model. Combining AMR to obtain
document level AMR is carried out using 3 methods; sentence conjunctions, concept
merging, and DocAMR. The similarity between documents is calculated using the
Smatch score resulting from the DocSmatch process.
Testing the document similarity model using the DocAMR graph representation
gave an accuracy result of 65.6976%, where this result was greater than the
baseline which used AMR sentence conjunction combination per sentence, namely
65.1162%. But the concept merging comparison method beats DocAMR with a
value of 77.9069%. Prediction errors occur because pieces of text taken from
iv
articles do not have sufficient concepts or information to represent the topic and
proximity of the document. The significant difference in prediction accuracy in each
test indicates that the Smatch value in the pair of documents being compared has a
small difference, so it is recommended that further research use another graph-
based document similarity calculation approach such as Graph Edit Distance or
Jaccard Similarity for graphs. |
---|