IMAGE CAPTIONING WITH EMOTION USING ENCODER-DECODER FRAMEWORK LSTM AND FACTORED LSTM

Image captioning with emotion is the process of generating meaningful word sequences to explain an image by adding a specific style to the sentence. Although there are several studies regarding the generating image captioning with emotions, research that uses Indonesian language still does not ex...

Full description

Saved in:
Bibliographic Details
Main Author: Rahman Ahaddienata, Dery
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/39912
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Image captioning with emotion is the process of generating meaningful word sequences to explain an image by adding a specific style to the sentence. Although there are several studies regarding the generating image captioning with emotions, research that uses Indonesian language still does not exist. In this final project, an encoder-decoder framework is used with ResNet152 as an encoder and Long Short-Term Memory (LSTM) as its decoder as in the Neural Image Captioning (NIC) study. Other variants of LSTM, called factored LSTM as mentioned in StyleNet research are also used for the purpose of generating image captioning with emotions. Attention mechanisms were added to these two architectures to improve their evaluation metrics. Learning method uses transfer learning and multitask learning. There are two evaluation metrics used, automatic evaluation metrics using BLEU metrics and manual evaluation through surveys to assess the attractiveness level between the results of factual sentences and emotion sentences. There are 2 types of datasets used, datasets for factual sentences with about 8000 data and 3 datasets for emotion sentences for happy emotions, sad emotions, and angry emotions. Creating an emotion dataset is done by an annotator, with the amount of collected data for each emotion are ~1000 sentences. The experiment was conducted to achieve the highest BLEU score for each factual and emotion dataset. The best models produced for factual datasets and emotion datasets will be evaluated through surveys. All models are trained end-to-end. The best results are achieved by the NIC architecture with the attention mechanism for generating factual sentences with BLEU-4 0.22. Best architecture for the generating emotional sentences achieved by StyleNet architecture with attention mechanisms. The BLEU-4 results for happy, sad, and angry emotions were 0.08, 0.09 and 0.10 respectively. In addition, evaluations through surveys provide a high level of attractiveness for models that produce emotion sentences. It gains 1.875% for factual sentences, 83.75%, 92.5%, and 87.5% for happy, sad, and angry sentences.