SegEQA : video segmentation based visual attention for embodied question answering

Embodied Question Answering (EQA) is a newly defined research area where an agent is required to answer the user's questions by exploring the real world environment. It has attracted increasing research interests due to its broad applications in automatic driving system, in-home robots, and per...

全面介紹

Saved in:

書目詳細資料
Main Authors:	Luo, Haonan, Lin, Guosheng, Liu, Zichuan, Liu, Fayao, Tang, Zhenmin, Yao, Yazhou
其他作者:	School of Computer Science and Engineering
格式:	Conference or Workshop Item
語言:	English
出版:	2020
主題:	Engineering::Computer science and engineering Computer Vision Image Fusion
在線閱讀:	https://hdl.handle.net/10356/144345
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!

id	sg-ntu-dr.10356-144345
record_format	dspace
spelling	sg-ntu-dr.10356-1443452020-10-29T06:02:35Z SegEQA : video segmentation based visual attention for embodied question answering Luo, Haonan Lin, Guosheng Liu, Zichuan Liu, Fayao Tang, Zhenmin Yao, Yazhou School of Computer Science and Engineering International Conference on Computer Vision (ICCV) 2019 Engineering::Computer science and engineering Computer Vision Image Fusion Embodied Question Answering (EQA) is a newly defined research area where an agent is required to answer the user's questions by exploring the real world environment. It has attracted increasing research interests due to its broad applications in automatic driving system, in-home robots, and personal assistants. Most of the existing methods perform poorly in terms of answering and navigation accuracy due to the absence of local details and vulnerability to the ambiguity caused by complicated vision conditions. To tackle these problems, we propose a segmentation based visual attention mechanism for Embodied Question Answering. Firstly, We extract the local semantic features by introducing a novel high-speed video segmentation framework. Then by the guide of extracted semantic features, a bottom-up visual attention mechanism is proposed for the Visual Question Answering (VQA) sub-task. Further, a feature fusion strategy is proposed to guide the training of the navigator without much additional computational cost. The ablation experiments show that our method boosts the performance of VQA module by 4.2% (68.99% vs 64.73%) and leads to 3.6% (48.59% vs 44.98%) overall improvement in EQA accuracy. AI Singapore Ministry of Education (MOE) National Research Foundation (NRF) Accepted version The authors would like to thank the financial support from the program of China Scholarships Council (No.201806840059). This work is partly supported by the National Research Foundation Singapore under its AI Singapore Programme [AISG-RP-2018-003] and the MOE Tier-I research grant [RG126/17 (S)]. We would like to thank NVIDIA for GPU donation. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore. 2020-10-29T06:02:35Z 2020-10-29T06:02:35Z 2019 Conference Paper Luo, H., Lin, G., Liu, Z., Liu, F., Tang, Z., & Yao, Y. (2019). SegEQA : video segmentation based visual attention for embodied question answering. Proceedings of the International Conference on Computer Vision (ICCV) 2019. doi:10.1109/ICCV.2019.00976 https://hdl.handle.net/10356/144345 10.1109/ICCV.2019.00976 en AISG-RP-2018-003 RG126/17 (S) © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/ICCV.2019.00976 application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering Computer Vision Image Fusion
spellingShingle	Engineering::Computer science and engineering Computer Vision Image Fusion Luo, Haonan Lin, Guosheng Liu, Zichuan Liu, Fayao Tang, Zhenmin Yao, Yazhou SegEQA : video segmentation based visual attention for embodied question answering
description	Embodied Question Answering (EQA) is a newly defined research area where an agent is required to answer the user's questions by exploring the real world environment. It has attracted increasing research interests due to its broad applications in automatic driving system, in-home robots, and personal assistants. Most of the existing methods perform poorly in terms of answering and navigation accuracy due to the absence of local details and vulnerability to the ambiguity caused by complicated vision conditions. To tackle these problems, we propose a segmentation based visual attention mechanism for Embodied Question Answering. Firstly, We extract the local semantic features by introducing a novel high-speed video segmentation framework. Then by the guide of extracted semantic features, a bottom-up visual attention mechanism is proposed for the Visual Question Answering (VQA) sub-task. Further, a feature fusion strategy is proposed to guide the training of the navigator without much additional computational cost. The ablation experiments show that our method boosts the performance of VQA module by 4.2% (68.99% vs 64.73%) and leads to 3.6% (48.59% vs 44.98%) overall improvement in EQA accuracy.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Luo, Haonan Lin, Guosheng Liu, Zichuan Liu, Fayao Tang, Zhenmin Yao, Yazhou
format	Conference or Workshop Item
author	Luo, Haonan Lin, Guosheng Liu, Zichuan Liu, Fayao Tang, Zhenmin Yao, Yazhou
author_sort	Luo, Haonan
title	SegEQA : video segmentation based visual attention for embodied question answering
title_short	SegEQA : video segmentation based visual attention for embodied question answering
title_full	SegEQA : video segmentation based visual attention for embodied question answering
title_fullStr	SegEQA : video segmentation based visual attention for embodied question answering
title_full_unstemmed	SegEQA : video segmentation based visual attention for embodied question answering
title_sort	segeqa : video segmentation based visual attention for embodied question answering
publishDate	2020
url	https://hdl.handle.net/10356/144345
_version_	1683493699686760448

SegEQA : video segmentation based visual attention for embodied question answering

相似書籍