2D and 3D visual understanding with limited supervision

Existing fully supervised deep learning methods usually require a large number of training samples with abundant annotations for the model training, which is extremely expensive and labor-consuming. Therefore, in order to alleviate huge labeling costs, it is highly desirable to develop weakly superv...

Full description

Saved in:
Bibliographic Details
Main Author: Wu, Zhonghua
Other Authors: Lin Guosheng
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/164693
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Existing fully supervised deep learning methods usually require a large number of training samples with abundant annotations for the model training, which is extremely expensive and labor-consuming. Therefore, in order to alleviate huge labeling costs, it is highly desirable to develop weakly supervised learning methods. Here, weakly supervised learning refers to that during the model training, the labels of the training data could be inexact, incomplete, or inaccurate. Typically, there are three types of weakly supervised learning scenarios: inexact supervision, incomplete supervision, and inaccurate supervision. In this thesis, we study all three types of weak supervision scenarios based on three different fundamental 2D and 3D recognition tasks including weakly supervised object detection (WSOD), few-shot image segmentation (FSS), and weakly supervised point cloud segmentation (WSPCS). Specifically, for WSOD, we only have the image-level annotations for the novel class images in the web domain, corresponding to inexact supervision. For FSS, we only have a few pixel-level labeled images (e.g., one or five images) for the novel classes, corresponding to incomplete supervision. For WSPCS, we consider partially labeled samples as weak annotations for the model training, i.e., only a few sparse points inside the whole scene are labeled and all other points are unlabeled, corresponding to incomplete supervision. Moreover, we observe one major limitation in existing consistency-based weakly supervised point cloud segmentation methods, i.e., unsatisfied pseudo labels due to the conventional confidence-based selection, which further leads to inaccurate supervision. For weakly supervised object detection, in Chapter 3, we propose a novel webly supervised object detection (WebSOD) method for novel class detection, which only requires the web images retrieved via the internet using class names as the keywords. Here, we only have the image-level annotations for web images during the model training. Our proposed method combines bottom-up and top-down cues. Within our approach, we introduce a bottom-up mechanism based on the well-trained fully supervised object detector (e.g., Faster RCNN) as an object region estimator for web images by recognizing the common objectiveness shared between base and novel classes. With the estimated regions on the web images, we then use the top-down attention cues as the guidance for region classification. Furthermore, we propose a residual feature refinement (RFR) block to tackle the domain mismatch between the web domain and the target domain. For few-shot image segmentation, currently, the state-of-the-art methods treat this task as a conditional foreground-background segmentation problem, assuming each class is independent. Different from existing methods, in Chapter 4, we introduce the concept of meta-class, which is the meta-information (e.g. certain middle-level features) shareable among all classes. To explicitly learn meta-class representations, we propose a novel Meta-class Memory-based few-shot segmentation method (MM-Net), where we introduce a set of learnable memory embeddings to memorize the meta-class information during the base class training and transfer it to novel classes during the inference stage. For weakly supervised point cloud segmentation, we only have a few sparse labeled points as well as a large number of unlabeled points. To exploit the unlabeled data, we design two different methods from the aspect of adversarial training and consistency training. Firstly, in Chapter 5, considering the smoothness-based methods have achieved promising progress, we advocate applying the consistency constraint under various perturbations to effectively regularize unlabeled 3D points. In particular, we propose a novel Dual Adaptive Transformations (DAT) model for weakly supervised point cloud segmentation, where the dual adaptive transformations are performed via an adversarial strategy at both point-level and region-level, aiming at enforcing the local and structural smoothness constraints on 3D point clouds. Secondly, in Chapter 6, we observe that the straightforward way of applying consistency constraints to weakly supervised point cloud segmentation has two major limitations: unsatisfied pseudo labels due to the conventional confidence-based selection and insufficient consistency constraints due to discarding unreliable pseudo labels. Therefore, we propose a novel Reliability-Adaptive Consistency Network (RAC-Net) to use both prediction confidence and model uncertainty to measure the reliability of pseudo labels and apply consistency training on all unlabeled points while with different consistency constraints for different points based on the reliability of corresponding pseudo labels.