Learning visual representations without human supervision

Supervised learning with deep neural networks has achieved great success in many visual recognition tasks including classification, detection, and segmentation. However, the need for expensive human annotations makes it increasingly prohibitive to embrace the massive amount of data available in the...

Full description

Saved in:
Bibliographic Details
Main Author: Xie, Jiahao
Other Authors: Chen Change Loy
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/171772
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-171772
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Xie, Jiahao
Learning visual representations without human supervision
description Supervised learning with deep neural networks has achieved great success in many visual recognition tasks including classification, detection, and segmentation. However, the need for expensive human annotations makes it increasingly prohibitive to embrace the massive amount of data available in the real world, which has been a major bottleneck for supervised learning. Thus, learning with unlabeled data is of substantial interest for furthering general-purpose visual representations. This thesis aims to learn visual representations from the data itself without manual annotations, a.k.a., self-supervised learning (SSL). Specifically, the thesis focuses on two broad representation learning categories: discriminative and generative. For discriminative representation learning, we first focus on clustering-based representation learning. We propose Online Deep Clustering (ODC) to tackle the long-standing challenge in traditional joint clustering and feature learning methods: the training schedule alternating between deep feature clustering and network parameters update leads to unstable learning of visual representations. Specifically, we maintain two dynamic memory modules, i.e., samples memory to store samples’ labels and features, and centroids memory for centroids evolution, and then update the two memory modules alongside the network update iterations. In this way, labels and the network evolve shoulder-to-shoulder rather than alternatingly, leading to more stable representation learning. Second, we delve into another popular stream in discriminative representation learning, i.e., contrastive learning. We first devote our effort to understanding the importance of inter-image invariance for contrastive learning. To this end, we perform a rigorous and comprehensive empirical study on inter-image invariance learning from three main constituting components: pseudo-label maintenance, sampling strategy, and decision boundary design. Through carefully-designed comparisons and analysis, we propose InterCLR, a unified and generic unsupervised intra- and inter-image invariance learning framework to improve conventional contrastive learning that only relies on intra-image statistics. After integrating inter-image invariance into contrastive learning, we further pay attention to contrastive learning on non-iconic scene images. We tackle the discrepancy of image-level contrastive learning between object-centric and scene-centric images by proposing a multi-stage self-supervised learning pipeline, namely ORL, realizing object-level representation learning from scene images. Specifically, we leverage image-level self-supervised pre-training as the prior for object-level semantic correspondence discovery, and use the obtained correspondence to construct positive object-instance pairs for object-level contrastive learning. Third, we shift our attention to the discriminative counterpart, i.e., generative representation learning. We first focus on masked image modeling, which is a representative masked-prediction-based representation learning paradigm. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, we propose Masked Frequency Modeling (MFM) to perform image corruptions in the frequency domain. Specifically, we first mask out a portion of frequency components of the input image and then predict the missing frequencies on the frequency spectrum. MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. We further comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with the proposed MFM approach. Finally, apart from learning implicit visual features, we explore another form of visual representations, i.e., generating explicit visual data. In particular, we leverage the explicit data generation ability of generative models to address the data-hungry and label-expensive issues for certain downstream tasks like instance segmentation, especially for rare and novel categories. To this end, we propose a diffusion-based data augmentation approach, namely MosaicFusion, to generate a significant amount of object instances and mask annotations without further model training and label supervision. Specifically, we first divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. We then obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, MosaicFusion can produce a significant amount of synthetic data for both rare and novel categories, which can be regarded as a form of unsupervised visual representations distilled from the generative models. This thesis presents the aforementioned discriminative and generative solutions to learning effective and efficient visual representations without human supervision. Extensive experiments on various downstream tasks demonstrate that all proposed methods are able to boost the performances of their respective baselines significantly.
author2 Chen Change Loy
author_facet Chen Change Loy
Xie, Jiahao
format Thesis-Doctor of Philosophy
author Xie, Jiahao
author_sort Xie, Jiahao
title Learning visual representations without human supervision
title_short Learning visual representations without human supervision
title_full Learning visual representations without human supervision
title_fullStr Learning visual representations without human supervision
title_full_unstemmed Learning visual representations without human supervision
title_sort learning visual representations without human supervision
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/171772
_version_ 1784855532301975552
spelling sg-ntu-dr.10356-1717722023-12-01T01:52:37Z Learning visual representations without human supervision Xie, Jiahao Chen Change Loy Ong Yew Soon School of Computer Science and Engineering ASYSOng@ntu.edu.sg, ccloy@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Supervised learning with deep neural networks has achieved great success in many visual recognition tasks including classification, detection, and segmentation. However, the need for expensive human annotations makes it increasingly prohibitive to embrace the massive amount of data available in the real world, which has been a major bottleneck for supervised learning. Thus, learning with unlabeled data is of substantial interest for furthering general-purpose visual representations. This thesis aims to learn visual representations from the data itself without manual annotations, a.k.a., self-supervised learning (SSL). Specifically, the thesis focuses on two broad representation learning categories: discriminative and generative. For discriminative representation learning, we first focus on clustering-based representation learning. We propose Online Deep Clustering (ODC) to tackle the long-standing challenge in traditional joint clustering and feature learning methods: the training schedule alternating between deep feature clustering and network parameters update leads to unstable learning of visual representations. Specifically, we maintain two dynamic memory modules, i.e., samples memory to store samples’ labels and features, and centroids memory for centroids evolution, and then update the two memory modules alongside the network update iterations. In this way, labels and the network evolve shoulder-to-shoulder rather than alternatingly, leading to more stable representation learning. Second, we delve into another popular stream in discriminative representation learning, i.e., contrastive learning. We first devote our effort to understanding the importance of inter-image invariance for contrastive learning. To this end, we perform a rigorous and comprehensive empirical study on inter-image invariance learning from three main constituting components: pseudo-label maintenance, sampling strategy, and decision boundary design. Through carefully-designed comparisons and analysis, we propose InterCLR, a unified and generic unsupervised intra- and inter-image invariance learning framework to improve conventional contrastive learning that only relies on intra-image statistics. After integrating inter-image invariance into contrastive learning, we further pay attention to contrastive learning on non-iconic scene images. We tackle the discrepancy of image-level contrastive learning between object-centric and scene-centric images by proposing a multi-stage self-supervised learning pipeline, namely ORL, realizing object-level representation learning from scene images. Specifically, we leverage image-level self-supervised pre-training as the prior for object-level semantic correspondence discovery, and use the obtained correspondence to construct positive object-instance pairs for object-level contrastive learning. Third, we shift our attention to the discriminative counterpart, i.e., generative representation learning. We first focus on masked image modeling, which is a representative masked-prediction-based representation learning paradigm. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, we propose Masked Frequency Modeling (MFM) to perform image corruptions in the frequency domain. Specifically, we first mask out a portion of frequency components of the input image and then predict the missing frequencies on the frequency spectrum. MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. We further comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with the proposed MFM approach. Finally, apart from learning implicit visual features, we explore another form of visual representations, i.e., generating explicit visual data. In particular, we leverage the explicit data generation ability of generative models to address the data-hungry and label-expensive issues for certain downstream tasks like instance segmentation, especially for rare and novel categories. To this end, we propose a diffusion-based data augmentation approach, namely MosaicFusion, to generate a significant amount of object instances and mask annotations without further model training and label supervision. Specifically, we first divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. We then obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, MosaicFusion can produce a significant amount of synthetic data for both rare and novel categories, which can be regarded as a form of unsupervised visual representations distilled from the generative models. This thesis presents the aforementioned discriminative and generative solutions to learning effective and efficient visual representations without human supervision. Extensive experiments on various downstream tasks demonstrate that all proposed methods are able to boost the performances of their respective baselines significantly. Doctor of Philosophy 2023-11-07T08:28:44Z 2023-11-07T08:28:44Z 2023 Thesis-Doctor of Philosophy Xie, J. (2023). Learning visual representations without human supervision. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/171772 https://hdl.handle.net/10356/171772 10.32657/10356/171772 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University