Learning visual representations without human supervision

Supervised learning with deep neural networks has achieved great success in many visual recognition tasks including classification, detection, and segmentation. However, the need for expensive human annotations makes it increasingly prohibitive to embrace the massive amount of data available in the...

全面介紹

Saved in:

書目詳細資料
主要作者:	Xie, Jiahao
其他作者:	Chen Change Loy
格式:	Thesis-Doctor of Philosophy
語言:	English
出版:	Nanyang Technological University 2023
主題:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
在線閱讀:	https://hdl.handle.net/10356/171772
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Nanyang Technological University
語言:	English

實物特徵
總結:	Supervised learning with deep neural networks has achieved great success in many visual recognition tasks including classification, detection, and segmentation. However, the need for expensive human annotations makes it increasingly prohibitive to embrace the massive amount of data available in the real world, which has been a major bottleneck for supervised learning. Thus, learning with unlabeled data is of substantial interest for furthering general-purpose visual representations. This thesis aims to learn visual representations from the data itself without manual annotations, a.k.a., self-supervised learning (SSL). Specifically, the thesis focuses on two broad representation learning categories: discriminative and generative. For discriminative representation learning, we first focus on clustering-based representation learning. We propose Online Deep Clustering (ODC) to tackle the long-standing challenge in traditional joint clustering and feature learning methods: the training schedule alternating between deep feature clustering and network parameters update leads to unstable learning of visual representations. Specifically, we maintain two dynamic memory modules, i.e., samples memory to store samples’ labels and features, and centroids memory for centroids evolution, and then update the two memory modules alongside the network update iterations. In this way, labels and the network evolve shoulder-to-shoulder rather than alternatingly, leading to more stable representation learning. Second, we delve into another popular stream in discriminative representation learning, i.e., contrastive learning. We first devote our effort to understanding the importance of inter-image invariance for contrastive learning. To this end, we perform a rigorous and comprehensive empirical study on inter-image invariance learning from three main constituting components: pseudo-label maintenance, sampling strategy, and decision boundary design. Through carefully-designed comparisons and analysis, we propose InterCLR, a unified and generic unsupervised intra- and inter-image invariance learning framework to improve conventional contrastive learning that only relies on intra-image statistics. After integrating inter-image invariance into contrastive learning, we further pay attention to contrastive learning on non-iconic scene images. We tackle the discrepancy of image-level contrastive learning between object-centric and scene-centric images by proposing a multi-stage self-supervised learning pipeline, namely ORL, realizing object-level representation learning from scene images. Specifically, we leverage image-level self-supervised pre-training as the prior for object-level semantic correspondence discovery, and use the obtained correspondence to construct positive object-instance pairs for object-level contrastive learning. Third, we shift our attention to the discriminative counterpart, i.e., generative representation learning. We first focus on masked image modeling, which is a representative masked-prediction-based representation learning paradigm. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, we propose Masked Frequency Modeling (MFM) to perform image corruptions in the frequency domain. Specifically, we first mask out a portion of frequency components of the input image and then predict the missing frequencies on the frequency spectrum. MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. We further comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with the proposed MFM approach. Finally, apart from learning implicit visual features, we explore another form of visual representations, i.e., generating explicit visual data. In particular, we leverage the explicit data generation ability of generative models to address the data-hungry and label-expensive issues for certain downstream tasks like instance segmentation, especially for rare and novel categories. To this end, we propose a diffusion-based data augmentation approach, namely MosaicFusion, to generate a significant amount of object instances and mask annotations without further model training and label supervision. Specifically, we first divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. We then obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, MosaicFusion can produce a significant amount of synthetic data for both rare and novel categories, which can be regarded as a form of unsupervised visual representations distilled from the generative models. This thesis presents the aforementioned discriminative and generative solutions to learning effective and efficient visual representations without human supervision. Extensive experiments on various downstream tasks demonstrate that all proposed methods are able to boost the performances of their respective baselines significantly.

Learning visual representations without human supervision

相似書籍