Simple image-level classification improves open-vocabulary object detection

Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set of base categories on which the detection model is trained. Recent OVOD methods focus on adapting the image-level pre-trained vision-language models (VLMs), such as CLIP, to a region-level object detection task v...

Full description

Saved in:
Bibliographic Details
Main Authors: FANG, Ruohuan, PANG, Guansong, BAI, Xiao
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8744
https://ink.library.smu.edu.sg/context/sis_research/article/9747/viewcontent/27939_Article_Text_31993_1_2_20240324.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9747
record_format dspace
spelling sg-smu-ink.sis_research-97472024-05-03T07:49:19Z Simple image-level classification improves open-vocabulary object detection FANG, Ruohuan PANG, Guansong BAI, Xiao Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set of base categories on which the detection model is trained. Recent OVOD methods focus on adapting the image-level pre-trained vision-language models (VLMs), such as CLIP, to a region-level object detection task via, eg., region-level knowledge distillation, regional prompt learning, or region-text pre-training, to expand the detection vocabulary. These methods have demonstrated remarkable performance in recognizing regional visual concepts, but they are weak in exploiting the VLMs' powerful global scene understanding ability learned from the billion-scale image-level text descriptions. This limits their capability in detecting hard objects of small, blurred, or occluded appearance from novel/base categories, whose detection heavily relies on contextual information. To address this, we propose a novel approach, namely Simple Image-level Classification for Context-Aware Detection Scoring (SIC-CADS), to leverage the superior global knowledge yielded from CLIP for complementing the current OVOD models from a global perspective. The core of SIC-CADS is a multi-modal multi-label recognition (MLR) module that learns the object co-occurrence-based contextual information from CLIP to recognize all possible object categories in the scene. These image-level MLR scores can then be utilized to refine the instance-level detection scores of the current OVOD models in detecting those hard objects. This is verified by extensive empirical results on two popular benchmarks, OV-LVIS and OV-COCO, which show that SIC-CADS achieves significant and consistent improvement when combined with different types of OVOD models. Further, SIC-CADS also improves the cross-dataset generalization ability on Objects365 and OpenImages. 2024-02-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8744 info:doi/10.1609/aaai.v38i2.27939 https://ink.library.smu.edu.sg/context/sis_research/article/9747/viewcontent/27939_Article_Text_31993_1_2_20240324.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Open-Vocabulary Object Detection (OVOD) Detection model Novel objects Databases and Information Systems Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Open-Vocabulary Object Detection (OVOD)
Detection model
Novel objects
Databases and Information Systems
Graphics and Human Computer Interfaces
spellingShingle Open-Vocabulary Object Detection (OVOD)
Detection model
Novel objects
Databases and Information Systems
Graphics and Human Computer Interfaces
FANG, Ruohuan
PANG, Guansong
BAI, Xiao
Simple image-level classification improves open-vocabulary object detection
description Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set of base categories on which the detection model is trained. Recent OVOD methods focus on adapting the image-level pre-trained vision-language models (VLMs), such as CLIP, to a region-level object detection task via, eg., region-level knowledge distillation, regional prompt learning, or region-text pre-training, to expand the detection vocabulary. These methods have demonstrated remarkable performance in recognizing regional visual concepts, but they are weak in exploiting the VLMs' powerful global scene understanding ability learned from the billion-scale image-level text descriptions. This limits their capability in detecting hard objects of small, blurred, or occluded appearance from novel/base categories, whose detection heavily relies on contextual information. To address this, we propose a novel approach, namely Simple Image-level Classification for Context-Aware Detection Scoring (SIC-CADS), to leverage the superior global knowledge yielded from CLIP for complementing the current OVOD models from a global perspective. The core of SIC-CADS is a multi-modal multi-label recognition (MLR) module that learns the object co-occurrence-based contextual information from CLIP to recognize all possible object categories in the scene. These image-level MLR scores can then be utilized to refine the instance-level detection scores of the current OVOD models in detecting those hard objects. This is verified by extensive empirical results on two popular benchmarks, OV-LVIS and OV-COCO, which show that SIC-CADS achieves significant and consistent improvement when combined with different types of OVOD models. Further, SIC-CADS also improves the cross-dataset generalization ability on Objects365 and OpenImages.
format text
author FANG, Ruohuan
PANG, Guansong
BAI, Xiao
author_facet FANG, Ruohuan
PANG, Guansong
BAI, Xiao
author_sort FANG, Ruohuan
title Simple image-level classification improves open-vocabulary object detection
title_short Simple image-level classification improves open-vocabulary object detection
title_full Simple image-level classification improves open-vocabulary object detection
title_fullStr Simple image-level classification improves open-vocabulary object detection
title_full_unstemmed Simple image-level classification improves open-vocabulary object detection
title_sort simple image-level classification improves open-vocabulary object detection
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/8744
https://ink.library.smu.edu.sg/context/sis_research/article/9747/viewcontent/27939_Article_Text_31993_1_2_20240324.pdf
_version_ 1814047499775639552