Real-world object detection

Object detection is a fundamental computer vision task that estimates object classification labels and location coordinates in images. Previous studies have consistently boosted the performance of object detectors. However, the real-world scenario introduces significant obstacles that hinder their o...

Full description

Saved in:

Bibliographic Details
Main Author:	Zang, Yuhang
Other Authors:	Chen Change Loy
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2023
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Online Access:	https://hdl.handle.net/10356/171489
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-171489
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Zang, Yuhang Real-world object detection
description	Object detection is a fundamental computer vision task that estimates object classification labels and location coordinates in images. Previous studies have consistently boosted the performance of object detectors. However, the real-world scenario introduces significant obstacles that hinder their overall effectiveness. In this thesis, we will concentrate specifically on two such challenges. The first challenge is the long-tailed data distribution, where real-world data often exhibits a significant imbalance in the number of images per category. Directly training an object detector on long-tailed data can introduce a bias toward head class objects, causing the omission of tail class objects. The second challenge is about generalizing to test samples from unseen classes that are not included in the training set. Detectors frequently make inaccurate classification predictions for objects from these unseen classes, including misclassifications as background or known categories. This thesis explores solutions to address the aforementioned challenges. For the long-tailed problem, we concentrate on two approaches: data augmentation (FASA) and semi-supervised learning (CascadeMatch). To enhance the detector's generalization ability, we investigate leveraging prior knowledge from Vision and Language Models (OV-DETR, UPT) or Multimodal Large Language Models (ContextDET). We first propose a simple yet effective method, Feature Augmentation and Sampling Adaptation (FASA), that addresses the long-tailed issue by augmenting the feature space, especially for rare classes. FASA does not require any elaborate loss design and removes the need for inter-class transfer learning that often involves large costs and manually-defined head/tail class groups. We show FASA is a fast, generic method that can be easily plugged into standard or long-tailed segmentation frameworks, with consistent performance gains and little added cost. Second, we propose CascadeMatch, a novel pseudo-labeling-based object detector that uses semi-supervised supervision to effectively tackle the long-tailed problem. CascadeMatch features a cascade network architecture that consists of multi-stage detection heads with incremental confidence thresholds. To avoid confirmation bias, each detection head is trained by the ensemble pseudo labels of all detection heads. To take into account the class-imbalance problem in real-world data that causes neural networks to give a higher/lower confidence to many/few-shot classes, we propose class-specific self-adaptive confidence thresholds, which are automatically tuned from labeled data with minimal human intervention. Third, to achieve generalization on unseen classes during testing, we propose a novel open-vocabulary detector called OV-DETR. Once trained, OV-DETR can detect any object given its class name or an exemplar image. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR achieves non-trivial improvements over the baseline methods. Fourth, we present a systematic study of unimodal prompt tuning methods, which serve as popular transfer learning paradigms for vision-language models like CLIP. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a conceptually simple approach called Unified Prompt Tuning (UPT), which learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on existing benchmarks. Finally, we introduce a novel research problem of contextual object detection---understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Extensive experiments show the advantages of ContextDET on a series of tasks, including our proposed contextual object detection, open-vocabulary detection, and referring image segmentation.
author2	Chen Change Loy
author_facet	Chen Change Loy Zang, Yuhang
format	Thesis-Doctor of Philosophy
author	Zang, Yuhang
author_sort	Zang, Yuhang
title	Real-world object detection
title_short	Real-world object detection
title_full	Real-world object detection
title_fullStr	Real-world object detection
title_full_unstemmed	Real-world object detection
title_sort	real-world object detection
publisher	Nanyang Technological University
publishDate	2023
url	https://hdl.handle.net/10356/171489
_version_	1781793793039663104
spelling	sg-ntu-dr.10356-1714892023-11-02T02:20:48Z Real-world object detection Zang, Yuhang Chen Change Loy School of Computer Science and Engineering ccloy@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Object detection is a fundamental computer vision task that estimates object classification labels and location coordinates in images. Previous studies have consistently boosted the performance of object detectors. However, the real-world scenario introduces significant obstacles that hinder their overall effectiveness. In this thesis, we will concentrate specifically on two such challenges. The first challenge is the long-tailed data distribution, where real-world data often exhibits a significant imbalance in the number of images per category. Directly training an object detector on long-tailed data can introduce a bias toward head class objects, causing the omission of tail class objects. The second challenge is about generalizing to test samples from unseen classes that are not included in the training set. Detectors frequently make inaccurate classification predictions for objects from these unseen classes, including misclassifications as background or known categories. This thesis explores solutions to address the aforementioned challenges. For the long-tailed problem, we concentrate on two approaches: data augmentation (FASA) and semi-supervised learning (CascadeMatch). To enhance the detector's generalization ability, we investigate leveraging prior knowledge from Vision and Language Models (OV-DETR, UPT) or Multimodal Large Language Models (ContextDET). We first propose a simple yet effective method, Feature Augmentation and Sampling Adaptation (FASA), that addresses the long-tailed issue by augmenting the feature space, especially for rare classes. FASA does not require any elaborate loss design and removes the need for inter-class transfer learning that often involves large costs and manually-defined head/tail class groups. We show FASA is a fast, generic method that can be easily plugged into standard or long-tailed segmentation frameworks, with consistent performance gains and little added cost. Second, we propose CascadeMatch, a novel pseudo-labeling-based object detector that uses semi-supervised supervision to effectively tackle the long-tailed problem. CascadeMatch features a cascade network architecture that consists of multi-stage detection heads with incremental confidence thresholds. To avoid confirmation bias, each detection head is trained by the ensemble pseudo labels of all detection heads. To take into account the class-imbalance problem in real-world data that causes neural networks to give a higher/lower confidence to many/few-shot classes, we propose class-specific self-adaptive confidence thresholds, which are automatically tuned from labeled data with minimal human intervention. Third, to achieve generalization on unseen classes during testing, we propose a novel open-vocabulary detector called OV-DETR. Once trained, OV-DETR can detect any object given its class name or an exemplar image. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR achieves non-trivial improvements over the baseline methods. Fourth, we present a systematic study of unimodal prompt tuning methods, which serve as popular transfer learning paradigms for vision-language models like CLIP. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a conceptually simple approach called Unified Prompt Tuning (UPT), which learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on existing benchmarks. Finally, we introduce a novel research problem of contextual object detection---understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Extensive experiments show the advantages of ContextDET on a series of tasks, including our proposed contextual object detection, open-vocabulary detection, and referring image segmentation. Doctor of Philosophy 2023-10-27T06:26:32Z 2023-10-27T06:26:32Z 2023 Thesis-Doctor of Philosophy Zang, Y. (2023). Real-world object detection. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/171489 https://hdl.handle.net/10356/171489 10.32657/10356/171489 en NTU NAP RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Singapore MOE AcRF Tier 2 (MOE-T2EP20120-0001) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Real-world object detection

Similar Items