Real-world object detection

Object detection is a fundamental computer vision task that estimates object classification labels and location coordinates in images. Previous studies have consistently boosted the performance of object detectors. However, the real-world scenario introduces significant obstacles that hinder their o...

Full description

Saved in:
Bibliographic Details
Main Author: Zang, Yuhang
Other Authors: Chen Change Loy
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/171489
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-171489
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Zang, Yuhang
Real-world object detection
description Object detection is a fundamental computer vision task that estimates object classification labels and location coordinates in images. Previous studies have consistently boosted the performance of object detectors. However, the real-world scenario introduces significant obstacles that hinder their overall effectiveness. In this thesis, we will concentrate specifically on two such challenges. The first challenge is the long-tailed data distribution, where real-world data often exhibits a significant imbalance in the number of images per category. Directly training an object detector on long-tailed data can introduce a bias toward head class objects, causing the omission of tail class objects. The second challenge is about generalizing to test samples from unseen classes that are not included in the training set. Detectors frequently make inaccurate classification predictions for objects from these unseen classes, including misclassifications as background or known categories. This thesis explores solutions to address the aforementioned challenges. For the long-tailed problem, we concentrate on two approaches: data augmentation (FASA) and semi-supervised learning (CascadeMatch). To enhance the detector's generalization ability, we investigate leveraging prior knowledge from Vision and Language Models (OV-DETR, UPT) or Multimodal Large Language Models (ContextDET). We first propose a simple yet effective method, Feature Augmentation and Sampling Adaptation (FASA), that addresses the long-tailed issue by augmenting the feature space, especially for rare classes. FASA does not require any elaborate loss design and removes the need for inter-class transfer learning that often involves large costs and manually-defined head/tail class groups. We show FASA is a fast, generic method that can be easily plugged into standard or long-tailed segmentation frameworks, with consistent performance gains and little added cost. Second, we propose CascadeMatch, a novel pseudo-labeling-based object detector that uses semi-supervised supervision to effectively tackle the long-tailed problem. CascadeMatch features a cascade network architecture that consists of multi-stage detection heads with incremental confidence thresholds. To avoid confirmation bias, each detection head is trained by the ensemble pseudo labels of all detection heads. To take into account the class-imbalance problem in real-world data that causes neural networks to give a higher/lower confidence to many/few-shot classes, we propose class-specific self-adaptive confidence thresholds, which are automatically tuned from labeled data with minimal human intervention. Third, to achieve generalization on unseen classes during testing, we propose a novel open-vocabulary detector called OV-DETR. Once trained, OV-DETR can detect any object given its class name or an exemplar image. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR achieves non-trivial improvements over the baseline methods. Fourth, we present a systematic study of unimodal prompt tuning methods, which serve as popular transfer learning paradigms for vision-language models like CLIP. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a conceptually simple approach called Unified Prompt Tuning (UPT), which learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on existing benchmarks. Finally, we introduce a novel research problem of contextual object detection---understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Extensive experiments show the advantages of ContextDET on a series of tasks, including our proposed contextual object detection, open-vocabulary detection, and referring image segmentation.
author2 Chen Change Loy
author_facet Chen Change Loy
Zang, Yuhang
format Thesis-Doctor of Philosophy
author Zang, Yuhang
author_sort Zang, Yuhang
title Real-world object detection
title_short Real-world object detection
title_full Real-world object detection
title_fullStr Real-world object detection
title_full_unstemmed Real-world object detection
title_sort real-world object detection
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/171489
_version_ 1781793793039663104
spelling sg-ntu-dr.10356-1714892023-11-02T02:20:48Z Real-world object detection Zang, Yuhang Chen Change Loy School of Computer Science and Engineering ccloy@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Object detection is a fundamental computer vision task that estimates object classification labels and location coordinates in images. Previous studies have consistently boosted the performance of object detectors. However, the real-world scenario introduces significant obstacles that hinder their overall effectiveness. In this thesis, we will concentrate specifically on two such challenges. The first challenge is the long-tailed data distribution, where real-world data often exhibits a significant imbalance in the number of images per category. Directly training an object detector on long-tailed data can introduce a bias toward head class objects, causing the omission of tail class objects. The second challenge is about generalizing to test samples from unseen classes that are not included in the training set. Detectors frequently make inaccurate classification predictions for objects from these unseen classes, including misclassifications as background or known categories. This thesis explores solutions to address the aforementioned challenges. For the long-tailed problem, we concentrate on two approaches: data augmentation (FASA) and semi-supervised learning (CascadeMatch). To enhance the detector's generalization ability, we investigate leveraging prior knowledge from Vision and Language Models (OV-DETR, UPT) or Multimodal Large Language Models (ContextDET). We first propose a simple yet effective method, Feature Augmentation and Sampling Adaptation (FASA), that addresses the long-tailed issue by augmenting the feature space, especially for rare classes. FASA does not require any elaborate loss design and removes the need for inter-class transfer learning that often involves large costs and manually-defined head/tail class groups. We show FASA is a fast, generic method that can be easily plugged into standard or long-tailed segmentation frameworks, with consistent performance gains and little added cost. Second, we propose CascadeMatch, a novel pseudo-labeling-based object detector that uses semi-supervised supervision to effectively tackle the long-tailed problem. CascadeMatch features a cascade network architecture that consists of multi-stage detection heads with incremental confidence thresholds. To avoid confirmation bias, each detection head is trained by the ensemble pseudo labels of all detection heads. To take into account the class-imbalance problem in real-world data that causes neural networks to give a higher/lower confidence to many/few-shot classes, we propose class-specific self-adaptive confidence thresholds, which are automatically tuned from labeled data with minimal human intervention. Third, to achieve generalization on unseen classes during testing, we propose a novel open-vocabulary detector called OV-DETR. Once trained, OV-DETR can detect any object given its class name or an exemplar image. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR achieves non-trivial improvements over the baseline methods. Fourth, we present a systematic study of unimodal prompt tuning methods, which serve as popular transfer learning paradigms for vision-language models like CLIP. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a conceptually simple approach called Unified Prompt Tuning (UPT), which learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on existing benchmarks. Finally, we introduce a novel research problem of contextual object detection---understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Extensive experiments show the advantages of ContextDET on a series of tasks, including our proposed contextual object detection, open-vocabulary detection, and referring image segmentation. Doctor of Philosophy 2023-10-27T06:26:32Z 2023-10-27T06:26:32Z 2023 Thesis-Doctor of Philosophy Zang, Y. (2023). Real-world object detection. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/171489 https://hdl.handle.net/10356/171489 10.32657/10356/171489 en NTU NAP RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Singapore MOE AcRF Tier 2 (MOE-T2EP20120-0001) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University