MOTION-BASED IMAGE CAPTIONING WITH INJECTION METHOD

While the utilization of deep learning on automated image caption generation has become a common practice nowadays, captions generated by machine are still considered less attractive compared by the ones that are written by human. The main reason behind this condition is that the captions that are g...

Full description

Saved in:
Bibliographic Details
Main Author: Wibisono Haryadi, Husnulzaki
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/43725
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:While the utilization of deep learning on automated image caption generation has become a common practice nowadays, captions generated by machine are still considered less attractive compared by the ones that are written by human. The main reason behind this condition is that the captions that are generated often lack in non-factual aspect, such as emotional or sentimental nuance, that are strongly embedded in human’s day-to-day communications. Similar condition also occurs in Indonesia, where the nonexistence of proper image caption dataset in Indonesia language become the major problem that prevents researches and development in related topics. To address this situation, we propose a deep learning architecture that is able to generate attractive caption in Indonesian language by imbuing emotional aspect into the generated sentences. To achieve this, we rewrote the labels in Flickr 8K and Flickr 10K (Gan et al. 2017) in Indonesian language, while also added new captions which are imbued with happy, sad, and angry emotions. We also adopts encoder-decoder framework (Vinyals, et al. 2015) that is already proven as the state-of the-art in image captioning domain and adapt the concept of sentiment cell (You, et al. 2018) which has claimed successful in processing sentiments in image captions. The model that we built accepts images as input and returns caption of the respective images as the output. ResNet-152 (He, et al. 2015), plays the role as the encoder, will extract visual vectors of the images before passes it to the decoder. The decoder component is implemented by modifying the amount of sentiment state found in the LSTM module of sentiment cell. From the experiment results, it is discovered that the models which handles multiple emotions tends to produce text in low quality, while model that only maintain a single emotion could produce consistent results even within an imbalance datasets.