Contextual object detection with multimodal large language models
Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual...
Saved in:
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/181063 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-181063 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1810632024-11-13T00:49:32Z Contextual object detection with multimodal large language models Zang, Yuhang Li, Wei Han, Jun Zhou, Kaiyang Loy, Chen Change College of Computing and Data Science Computer and Information Science Image segmentation Object detection Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection—understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a visual decoder for predicting bounding boxes given contextual object words. The new generate-then-detect framework enables us to detect object words within human vocabulary. Extensive experiments show the advantages of ContextDET on our proposed CODE benchmark, open-vocabulary detection, and referring image segmentation. Agency for Science, Technology and Research (A*STAR) Ministry of Education (MOE) Nanyang Technological University This study is supported under the RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). It is also partly supported by the NTU NAP grant and Singapore MOE AcRF Tier 2 (MOE-T2EP20120-0001). 2024-11-13T00:49:32Z 2024-11-13T00:49:32Z 2024 Journal Article Zang, Y., Li, W., Han, J., Zhou, K. & Loy, C. C. (2024). Contextual object detection with multimodal large language models. International Journal of Computer Vision. https://dx.doi.org/10.1007/s11263-024-02214-4 0920-5691 https://hdl.handle.net/10356/181063 10.1007/s11263-024-02214-4 2-s2.0-85201827041 en IAF-ICP NTU NAP MOE-T2EP20120-0001 International Journal of Computer Vision © 2024 The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature. |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Image segmentation Object detection |
spellingShingle |
Computer and Information Science Image segmentation Object detection Zang, Yuhang Li, Wei Han, Jun Zhou, Kaiyang Loy, Chen Change Contextual object detection with multimodal large language models |
description |
Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection—understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a visual decoder for predicting bounding boxes given contextual object words. The new generate-then-detect framework enables us to detect object words within human vocabulary. Extensive experiments show the advantages of ContextDET on our proposed CODE benchmark, open-vocabulary detection, and referring image segmentation. |
author2 |
College of Computing and Data Science |
author_facet |
College of Computing and Data Science Zang, Yuhang Li, Wei Han, Jun Zhou, Kaiyang Loy, Chen Change |
format |
Article |
author |
Zang, Yuhang Li, Wei Han, Jun Zhou, Kaiyang Loy, Chen Change |
author_sort |
Zang, Yuhang |
title |
Contextual object detection with multimodal large language models |
title_short |
Contextual object detection with multimodal large language models |
title_full |
Contextual object detection with multimodal large language models |
title_fullStr |
Contextual object detection with multimodal large language models |
title_full_unstemmed |
Contextual object detection with multimodal large language models |
title_sort |
contextual object detection with multimodal large language models |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/181063 |
_version_ |
1816859051891359744 |