Image captioning via semantic element embedding

Image caption approaches that use the global Convolutional Neural Network (CNN) features are not able to represent and describe all the important elements in complex scenes. In this paper, we propose to enrich the semantic representations of images and update the language model by proposing semantic...

Full description

Saved in:
Bibliographic Details
Main Authors: ZHANG, Xiaodan, HE, Shengfeng, SONG, Xinhang, LAU, Rynson W.H., JIAO, Jianbin, YE, Qixiang
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2020
Subjects:
CNN
Online Access:https://ink.library.smu.edu.sg/sis_research/7863
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-8866
record_format dspace
spelling sg-smu-ink.sis_research-88662023-06-15T09:00:05Z Image captioning via semantic element embedding ZHANG, Xiaodan HE, Shengfeng SONG, Xinhang LAU, Rynson W.H. JIAO, Jianbin YE, Qixiang Image caption approaches that use the global Convolutional Neural Network (CNN) features are not able to represent and describe all the important elements in complex scenes. In this paper, we propose to enrich the semantic representations of images and update the language model by proposing semantic element embedding. For the semantic element discovery, an object detection module is used to predict regions of the image, and a captioning model, Long Short-Term Memory (LSTM), is employed to generate local descriptions for these regions. The predicted descriptions and categories are used to generate the semantic feature, which not only contains detailed information but also shares a word space with descriptions, and thus bridges the modality gap between visual images and semantic captions. We further integrate the CNN feature with the semantic feature into the proposed Element Embedding LSTM (EE-LSTM) model to predict a language description. Experiments on MS COCO datasets demonstrate that the proposed approach outperforms conventional caption methods and is flexible to combine with baseline models to achieve superior performance. (C) 2019 Published by Elsevier B.V. 2020-06-28T07:00:00Z text https://ink.library.smu.edu.sg/sis_research/7863 info:doi/10.1016/j.neucom.2018.02.112 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Image captioning Element embedding CNN LSTM Information Security
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Image captioning
Element embedding
CNN
LSTM
Information Security
spellingShingle Image captioning
Element embedding
CNN
LSTM
Information Security
ZHANG, Xiaodan
HE, Shengfeng
SONG, Xinhang
LAU, Rynson W.H.
JIAO, Jianbin
YE, Qixiang
Image captioning via semantic element embedding
description Image caption approaches that use the global Convolutional Neural Network (CNN) features are not able to represent and describe all the important elements in complex scenes. In this paper, we propose to enrich the semantic representations of images and update the language model by proposing semantic element embedding. For the semantic element discovery, an object detection module is used to predict regions of the image, and a captioning model, Long Short-Term Memory (LSTM), is employed to generate local descriptions for these regions. The predicted descriptions and categories are used to generate the semantic feature, which not only contains detailed information but also shares a word space with descriptions, and thus bridges the modality gap between visual images and semantic captions. We further integrate the CNN feature with the semantic feature into the proposed Element Embedding LSTM (EE-LSTM) model to predict a language description. Experiments on MS COCO datasets demonstrate that the proposed approach outperforms conventional caption methods and is flexible to combine with baseline models to achieve superior performance. (C) 2019 Published by Elsevier B.V.
format text
author ZHANG, Xiaodan
HE, Shengfeng
SONG, Xinhang
LAU, Rynson W.H.
JIAO, Jianbin
YE, Qixiang
author_facet ZHANG, Xiaodan
HE, Shengfeng
SONG, Xinhang
LAU, Rynson W.H.
JIAO, Jianbin
YE, Qixiang
author_sort ZHANG, Xiaodan
title Image captioning via semantic element embedding
title_short Image captioning via semantic element embedding
title_full Image captioning via semantic element embedding
title_fullStr Image captioning via semantic element embedding
title_full_unstemmed Image captioning via semantic element embedding
title_sort image captioning via semantic element embedding
publisher Institutional Knowledge at Singapore Management University
publishDate 2020
url https://ink.library.smu.edu.sg/sis_research/7863
_version_ 1770576571635073024