Learning visual representations without human supervision
Supervised learning with deep neural networks has achieved great success in many visual recognition tasks including classification, detection, and segmentation. However, the need for expensive human annotations makes it increasingly prohibitive to embrace the massive amount of data available in the...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/171772 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Supervised learning with deep neural networks has achieved great success in many visual recognition tasks including classification, detection, and segmentation. However, the need for expensive human annotations makes it increasingly prohibitive to embrace the massive amount of data available in the real world, which has been a major bottleneck for supervised learning. Thus, learning with unlabeled data is of substantial interest for furthering general-purpose visual representations. This thesis aims to learn visual representations from the data itself without manual annotations, a.k.a., self-supervised learning (SSL). Specifically, the thesis focuses on two broad representation learning categories: discriminative and generative. For discriminative representation learning, we first focus on clustering-based representation learning. We propose Online Deep Clustering (ODC) to tackle the long-standing challenge in traditional joint clustering and feature learning methods: the training schedule alternating between deep feature clustering and network parameters update leads to unstable learning of visual representations. Specifically, we maintain two dynamic memory modules, i.e., samples memory to store samples’ labels and features, and centroids memory for centroids evolution, and then update the two memory modules alongside the network update iterations. In this way, labels and the network evolve shoulder-to-shoulder rather than alternatingly, leading to more stable representation learning. Second, we delve into another popular stream in discriminative representation learning, i.e., contrastive learning. We first devote our effort to understanding the importance of inter-image invariance for contrastive learning. To this end, we perform a rigorous and comprehensive empirical study on inter-image invariance learning from three main constituting components: pseudo-label maintenance, sampling strategy, and decision boundary design. Through carefully-designed comparisons and analysis, we propose InterCLR, a unified and generic unsupervised intra- and inter-image invariance learning framework to improve conventional contrastive learning that only relies on intra-image statistics. After integrating inter-image invariance into contrastive learning, we further pay attention to contrastive learning on non-iconic scene images. We tackle the discrepancy of image-level contrastive learning between object-centric and scene-centric images by proposing a multi-stage self-supervised learning pipeline, namely ORL, realizing object-level representation learning from scene images. Specifically, we leverage image-level self-supervised pre-training as the prior for object-level semantic correspondence discovery, and use the obtained correspondence to construct positive object-instance pairs for object-level contrastive learning. Third, we shift our attention to the discriminative counterpart, i.e., generative representation learning. We first focus on masked image modeling, which is a representative masked-prediction-based representation learning paradigm. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, we propose Masked Frequency Modeling (MFM) to perform image corruptions in the frequency domain. Specifically, we first mask out a portion of frequency components of the input image and then predict the missing frequencies on the frequency spectrum. MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. We further comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with the proposed MFM approach. Finally, apart from learning implicit visual features, we explore another form of visual representations, i.e., generating explicit visual data. In particular, we leverage the explicit data generation ability of generative models to address the data-hungry and label-expensive issues for certain downstream tasks like instance segmentation, especially for rare and novel categories. To this end, we propose a diffusion-based data augmentation approach, namely MosaicFusion, to generate a significant amount of object instances and mask annotations without further model training and label supervision. Specifically, we first divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. We then obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, MosaicFusion can produce a significant amount of synthetic data for both rare and novel categories, which can be regarded as a form of unsupervised visual representations distilled from the generative models. This thesis presents the aforementioned discriminative and generative solutions to learning effective and efficient visual representations without human supervision. Extensive experiments on various downstream tasks demonstrate that all proposed methods are able to boost the performances of their respective baselines significantly. |
---|