Context-aware visual policy network for fine-grained image captioning
With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragr...
Saved in:
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/162628 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-162628 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1626282022-11-01T06:41:53Z Context-aware visual policy network for fine-grained image captioning Zha, Zheng-Jun Liu, Daqing Zhang, Hanwang Zhang, Yongdong Wu, Feng School of Computer Science and Engineering Engineering::Computer science and engineering Image Captioning Reinforcement Learning With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., "man riding horse") and visual comparisons (e.g., "small(er) cat"). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model-CAVP and its subsequent language policy network-can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context. This work was supported by the National Key R&D Program of China under Grant 2017YFB1300201, the National Natural Science Foundation of China (NSFC) under Grants 61622211, 61620106009 and 61525206 as well as the Fundamental Research Funds for the Central Universities under Grant WK2100100030. 2022-11-01T06:41:53Z 2022-11-01T06:41:53Z 2019 Journal Article Zha, Z., Liu, D., Zhang, H., Zhang, Y. & Wu, F. (2019). Context-aware visual policy network for fine-grained image captioning. IEEE Transactions On Pattern Analysis and Machine Intelligence, 44(2), 710-722. https://dx.doi.org/10.1109/TPAMI.2019.2909864 0162-8828 https://hdl.handle.net/10356/162628 10.1109/TPAMI.2019.2909864 30969916 2-s2.0-85122829367 2 44 710 722 en IEEE Transactions on Pattern Analysis and Machine Intelligence © 2019 IEEE. All rights reserved. |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering Image Captioning Reinforcement Learning |
spellingShingle |
Engineering::Computer science and engineering Image Captioning Reinforcement Learning Zha, Zheng-Jun Liu, Daqing Zhang, Hanwang Zhang, Yongdong Wu, Feng Context-aware visual policy network for fine-grained image captioning |
description |
With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., "man riding horse") and visual comparisons (e.g., "small(er) cat"). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model-CAVP and its subsequent language policy network-can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context. |
author2 |
School of Computer Science and Engineering |
author_facet |
School of Computer Science and Engineering Zha, Zheng-Jun Liu, Daqing Zhang, Hanwang Zhang, Yongdong Wu, Feng |
format |
Article |
author |
Zha, Zheng-Jun Liu, Daqing Zhang, Hanwang Zhang, Yongdong Wu, Feng |
author_sort |
Zha, Zheng-Jun |
title |
Context-aware visual policy network for fine-grained image captioning |
title_short |
Context-aware visual policy network for fine-grained image captioning |
title_full |
Context-aware visual policy network for fine-grained image captioning |
title_fullStr |
Context-aware visual policy network for fine-grained image captioning |
title_full_unstemmed |
Context-aware visual policy network for fine-grained image captioning |
title_sort |
context-aware visual policy network for fine-grained image captioning |
publishDate |
2022 |
url |
https://hdl.handle.net/10356/162628 |
_version_ |
1749179158442803200 |