Real-world object detection
Object detection is a fundamental computer vision task that estimates object classification labels and location coordinates in images. Previous studies have consistently boosted the performance of object detectors. However, the real-world scenario introduces significant obstacles that hinder their o...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/171489 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-171489 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Zang, Yuhang Real-world object detection |
description |
Object detection is a fundamental computer vision task that estimates object classification labels and location coordinates in images. Previous studies have consistently boosted the performance of object detectors. However, the real-world scenario introduces significant obstacles that hinder their overall effectiveness. In this thesis, we will concentrate specifically on two such challenges. The first challenge is the long-tailed data distribution, where real-world data often exhibits a significant imbalance in the number of images per category. Directly training an object detector on long-tailed data can introduce a bias toward head class objects, causing the omission of tail class objects. The second challenge is about generalizing to test samples from unseen classes that are not included in the training set. Detectors frequently make inaccurate classification predictions for objects from these unseen classes, including misclassifications as background or known categories. This thesis explores solutions to address the aforementioned challenges. For the long-tailed problem, we concentrate on two approaches: data augmentation (FASA) and semi-supervised learning (CascadeMatch). To enhance the detector's generalization ability, we investigate leveraging prior knowledge from Vision and Language Models (OV-DETR, UPT) or Multimodal Large Language Models (ContextDET). We first propose a simple yet effective method, Feature Augmentation and Sampling Adaptation (FASA), that addresses the long-tailed issue by augmenting the feature space, especially for rare classes. FASA does not require any elaborate loss design and removes the need for inter-class transfer learning that often involves large costs and manually-defined head/tail class groups. We show FASA is a fast, generic method that can be easily plugged into standard or long-tailed segmentation frameworks, with consistent performance gains and little added cost. Second, we propose CascadeMatch, a novel pseudo-labeling-based object detector that uses semi-supervised supervision to effectively tackle the long-tailed problem. CascadeMatch features a cascade network architecture that consists of multi-stage detection heads with incremental confidence thresholds. To avoid confirmation bias, each detection head is trained by the ensemble pseudo labels of all detection heads. To take into account the class-imbalance problem in real-world data that causes neural networks to give a higher/lower confidence to many/few-shot classes, we propose class-specific self-adaptive confidence thresholds, which are automatically tuned from labeled data with minimal human intervention. Third, to achieve generalization on unseen classes during testing, we propose a novel open-vocabulary detector called OV-DETR. Once trained, OV-DETR can detect any object given its class name or an exemplar image. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR achieves non-trivial improvements over the baseline methods. Fourth, we present a systematic study of unimodal prompt tuning methods, which serve as popular transfer learning paradigms for vision-language models like CLIP. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a conceptually simple approach called Unified Prompt Tuning (UPT), which learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on existing benchmarks. Finally, we introduce a novel research problem of contextual object detection---understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Extensive experiments show the advantages of ContextDET on a series of tasks, including our proposed contextual object detection, open-vocabulary detection, and referring image segmentation. |
author2 |
Chen Change Loy |
author_facet |
Chen Change Loy Zang, Yuhang |
format |
Thesis-Doctor of Philosophy |
author |
Zang, Yuhang |
author_sort |
Zang, Yuhang |
title |
Real-world object detection |
title_short |
Real-world object detection |
title_full |
Real-world object detection |
title_fullStr |
Real-world object detection |
title_full_unstemmed |
Real-world object detection |
title_sort |
real-world object detection |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/171489 |
_version_ |
1781793793039663104 |
spelling |
sg-ntu-dr.10356-1714892023-11-02T02:20:48Z Real-world object detection Zang, Yuhang Chen Change Loy School of Computer Science and Engineering ccloy@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Object detection is a fundamental computer vision task that estimates object classification labels and location coordinates in images. Previous studies have consistently boosted the performance of object detectors. However, the real-world scenario introduces significant obstacles that hinder their overall effectiveness. In this thesis, we will concentrate specifically on two such challenges. The first challenge is the long-tailed data distribution, where real-world data often exhibits a significant imbalance in the number of images per category. Directly training an object detector on long-tailed data can introduce a bias toward head class objects, causing the omission of tail class objects. The second challenge is about generalizing to test samples from unseen classes that are not included in the training set. Detectors frequently make inaccurate classification predictions for objects from these unseen classes, including misclassifications as background or known categories. This thesis explores solutions to address the aforementioned challenges. For the long-tailed problem, we concentrate on two approaches: data augmentation (FASA) and semi-supervised learning (CascadeMatch). To enhance the detector's generalization ability, we investigate leveraging prior knowledge from Vision and Language Models (OV-DETR, UPT) or Multimodal Large Language Models (ContextDET). We first propose a simple yet effective method, Feature Augmentation and Sampling Adaptation (FASA), that addresses the long-tailed issue by augmenting the feature space, especially for rare classes. FASA does not require any elaborate loss design and removes the need for inter-class transfer learning that often involves large costs and manually-defined head/tail class groups. We show FASA is a fast, generic method that can be easily plugged into standard or long-tailed segmentation frameworks, with consistent performance gains and little added cost. Second, we propose CascadeMatch, a novel pseudo-labeling-based object detector that uses semi-supervised supervision to effectively tackle the long-tailed problem. CascadeMatch features a cascade network architecture that consists of multi-stage detection heads with incremental confidence thresholds. To avoid confirmation bias, each detection head is trained by the ensemble pseudo labels of all detection heads. To take into account the class-imbalance problem in real-world data that causes neural networks to give a higher/lower confidence to many/few-shot classes, we propose class-specific self-adaptive confidence thresholds, which are automatically tuned from labeled data with minimal human intervention. Third, to achieve generalization on unseen classes during testing, we propose a novel open-vocabulary detector called OV-DETR. Once trained, OV-DETR can detect any object given its class name or an exemplar image. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR achieves non-trivial improvements over the baseline methods. Fourth, we present a systematic study of unimodal prompt tuning methods, which serve as popular transfer learning paradigms for vision-language models like CLIP. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a conceptually simple approach called Unified Prompt Tuning (UPT), which learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on existing benchmarks. Finally, we introduce a novel research problem of contextual object detection---understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Extensive experiments show the advantages of ContextDET on a series of tasks, including our proposed contextual object detection, open-vocabulary detection, and referring image segmentation. Doctor of Philosophy 2023-10-27T06:26:32Z 2023-10-27T06:26:32Z 2023 Thesis-Doctor of Philosophy Zang, Y. (2023). Real-world object detection. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/171489 https://hdl.handle.net/10356/171489 10.32657/10356/171489 en NTU NAP RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Singapore MOE AcRF Tier 2 (MOE-T2EP20120-0001) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |