WORD EMBEDDING IN MULTI-DOCUMENT NEWS SUMMARIZATION USING SENTENCE FUSION

<p align="justify">Nowadays, the flow of demand and supply of publicly available information is very large. There is also abundance of online news websites that regularly post similar new information regarding certain topics. This causes many reoccuring information duplicate, thus mo...

Full description

Saved in:
Bibliographic Details
Main Author: CHRISTIE - NIM: 23516083 , FELICIA
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/27271
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:27271
spelling id-itb.:272712018-09-28T09:08:21ZWORD EMBEDDING IN MULTI-DOCUMENT NEWS SUMMARIZATION USING SENTENCE FUSION CHRISTIE - NIM: 23516083 , FELICIA Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/27271 <p align="justify">Nowadays, the flow of demand and supply of publicly available information is very large. There is also abundance of online news websites that regularly post similar new information regarding certain topics. This causes many reoccuring information duplicate, thus more time is needed to process all relevant data regarding a topic. This leads to need of summarization systems as an alternative to reduce the processing time. <br /> <br /> This thesis discusses a summarization system with minimal dependencies to natural language processing resources, in which we design a minimal-dependency system by only using Indonesian POS-Tagger, word embedding models from unsupervised learning, and list of Indonesian stopwords. Our method consists of seven main steps to create a summary, including tokenization, POS-Tagging, term weighting with TF-IDF and word embedding, clustering, sentence fusion by word graphs, extracting said sentences, and finally sentence selection with integer linear programming algorithm. Evaluation is conducted with ROUGE 2, with mainly focusing on ROUGE-1 and ROUGE-2. <br /> <br /> By using several datasets for tuning, we obtain the optimal configuration which will be used on 5 test sets. From the experiments, we obtain the best score with Indonesian Word2Vec model for term weighting on clustering. At last, we obtain ROUGE-2 value of 0.231 for 100-word documents in average, and 0.319 for 200word documents in average. <p align="justify"> text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description <p align="justify">Nowadays, the flow of demand and supply of publicly available information is very large. There is also abundance of online news websites that regularly post similar new information regarding certain topics. This causes many reoccuring information duplicate, thus more time is needed to process all relevant data regarding a topic. This leads to need of summarization systems as an alternative to reduce the processing time. <br /> <br /> This thesis discusses a summarization system with minimal dependencies to natural language processing resources, in which we design a minimal-dependency system by only using Indonesian POS-Tagger, word embedding models from unsupervised learning, and list of Indonesian stopwords. Our method consists of seven main steps to create a summary, including tokenization, POS-Tagging, term weighting with TF-IDF and word embedding, clustering, sentence fusion by word graphs, extracting said sentences, and finally sentence selection with integer linear programming algorithm. Evaluation is conducted with ROUGE 2, with mainly focusing on ROUGE-1 and ROUGE-2. <br /> <br /> By using several datasets for tuning, we obtain the optimal configuration which will be used on 5 test sets. From the experiments, we obtain the best score with Indonesian Word2Vec model for term weighting on clustering. At last, we obtain ROUGE-2 value of 0.231 for 100-word documents in average, and 0.319 for 200word documents in average. <p align="justify">
format Theses
author CHRISTIE - NIM: 23516083 , FELICIA
spellingShingle CHRISTIE - NIM: 23516083 , FELICIA
WORD EMBEDDING IN MULTI-DOCUMENT NEWS SUMMARIZATION USING SENTENCE FUSION
author_facet CHRISTIE - NIM: 23516083 , FELICIA
author_sort CHRISTIE - NIM: 23516083 , FELICIA
title WORD EMBEDDING IN MULTI-DOCUMENT NEWS SUMMARIZATION USING SENTENCE FUSION
title_short WORD EMBEDDING IN MULTI-DOCUMENT NEWS SUMMARIZATION USING SENTENCE FUSION
title_full WORD EMBEDDING IN MULTI-DOCUMENT NEWS SUMMARIZATION USING SENTENCE FUSION
title_fullStr WORD EMBEDDING IN MULTI-DOCUMENT NEWS SUMMARIZATION USING SENTENCE FUSION
title_full_unstemmed WORD EMBEDDING IN MULTI-DOCUMENT NEWS SUMMARIZATION USING SENTENCE FUSION
title_sort word embedding in multi-document news summarization using sentence fusion
url https://digilib.itb.ac.id/gdl/view/27271
_version_ 1821934329790464000