Learning visual representations without human supervision

Supervised learning with deep neural networks has achieved great success in many visual recognition tasks including classification, detection, and segmentation. However, the need for expensive human annotations makes it increasingly prohibitive to embrace the massive amount of data available in the...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Xie, Jiahao
مؤلفون آخرون:	Chen Change Loy
التنسيق:	Thesis-Doctor of Philosophy
اللغة:	English
منشور في:	Nanyang Technological University 2023
الموضوعات:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/171772
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Nanyang Technological University
اللغة:	English

id	sg-ntu-dr.10356-171772
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Xie, Jiahao Learning visual representations without human supervision
description	Supervised learning with deep neural networks has achieved great success in many visual recognition tasks including classification, detection, and segmentation. However, the need for expensive human annotations makes it increasingly prohibitive to embrace the massive amount of data available in the real world, which has been a major bottleneck for supervised learning. Thus, learning with unlabeled data is of substantial interest for furthering general-purpose visual representations. This thesis aims to learn visual representations from the data itself without manual annotations, a.k.a., self-supervised learning (SSL). Specifically, the thesis focuses on two broad representation learning categories: discriminative and generative. For discriminative representation learning, we first focus on clustering-based representation learning. We propose Online Deep Clustering (ODC) to tackle the long-standing challenge in traditional joint clustering and feature learning methods: the training schedule alternating between deep feature clustering and network parameters update leads to unstable learning of visual representations. Specifically, we maintain two dynamic memory modules, i.e., samples memory to store samples’ labels and features, and centroids memory for centroids evolution, and then update the two memory modules alongside the network update iterations. In this way, labels and the network evolve shoulder-to-shoulder rather than alternatingly, leading to more stable representation learning. Second, we delve into another popular stream in discriminative representation learning, i.e., contrastive learning. We first devote our effort to understanding the importance of inter-image invariance for contrastive learning. To this end, we perform a rigorous and comprehensive empirical study on inter-image invariance learning from three main constituting components: pseudo-label maintenance, sampling strategy, and decision boundary design. Through carefully-designed comparisons and analysis, we propose InterCLR, a unified and generic unsupervised intra- and inter-image invariance learning framework to improve conventional contrastive learning that only relies on intra-image statistics. After integrating inter-image invariance into contrastive learning, we further pay attention to contrastive learning on non-iconic scene images. We tackle the discrepancy of image-level contrastive learning between object-centric and scene-centric images by proposing a multi-stage self-supervised learning pipeline, namely ORL, realizing object-level representation learning from scene images. Specifically, we leverage image-level self-supervised pre-training as the prior for object-level semantic correspondence discovery, and use the obtained correspondence to construct positive object-instance pairs for object-level contrastive learning. Third, we shift our attention to the discriminative counterpart, i.e., generative representation learning. We first focus on masked image modeling, which is a representative masked-prediction-based representation learning paradigm. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, we propose Masked Frequency Modeling (MFM) to perform image corruptions in the frequency domain. Specifically, we first mask out a portion of frequency components of the input image and then predict the missing frequencies on the frequency spectrum. MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. We further comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with the proposed MFM approach. Finally, apart from learning implicit visual features, we explore another form of visual representations, i.e., generating explicit visual data. In particular, we leverage the explicit data generation ability of generative models to address the data-hungry and label-expensive issues for certain downstream tasks like instance segmentation, especially for rare and novel categories. To this end, we propose a diffusion-based data augmentation approach, namely MosaicFusion, to generate a significant amount of object instances and mask annotations without further model training and label supervision. Specifically, we first divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. We then obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, MosaicFusion can produce a significant amount of synthetic data for both rare and novel categories, which can be regarded as a form of unsupervised visual representations distilled from the generative models. This thesis presents the aforementioned discriminative and generative solutions to learning effective and efficient visual representations without human supervision. Extensive experiments on various downstream tasks demonstrate that all proposed methods are able to boost the performances of their respective baselines significantly.
author2	Chen Change Loy
author_facet	Chen Change Loy Xie, Jiahao
format	Thesis-Doctor of Philosophy
author	Xie, Jiahao
author_sort	Xie, Jiahao
title	Learning visual representations without human supervision
title_short	Learning visual representations without human supervision
title_full	Learning visual representations without human supervision
title_fullStr	Learning visual representations without human supervision
title_full_unstemmed	Learning visual representations without human supervision
title_sort	learning visual representations without human supervision
publisher	Nanyang Technological University
publishDate	2023
url	https://hdl.handle.net/10356/171772
_version_	1784855532301975552
spelling	sg-ntu-dr.10356-1717722023-12-01T01:52:37Z Learning visual representations without human supervision Xie, Jiahao Chen Change Loy Ong Yew Soon School of Computer Science and Engineering ASYSOng@ntu.edu.sg, ccloy@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Supervised learning with deep neural networks has achieved great success in many visual recognition tasks including classification, detection, and segmentation. However, the need for expensive human annotations makes it increasingly prohibitive to embrace the massive amount of data available in the real world, which has been a major bottleneck for supervised learning. Thus, learning with unlabeled data is of substantial interest for furthering general-purpose visual representations. This thesis aims to learn visual representations from the data itself without manual annotations, a.k.a., self-supervised learning (SSL). Specifically, the thesis focuses on two broad representation learning categories: discriminative and generative. For discriminative representation learning, we first focus on clustering-based representation learning. We propose Online Deep Clustering (ODC) to tackle the long-standing challenge in traditional joint clustering and feature learning methods: the training schedule alternating between deep feature clustering and network parameters update leads to unstable learning of visual representations. Specifically, we maintain two dynamic memory modules, i.e., samples memory to store samples’ labels and features, and centroids memory for centroids evolution, and then update the two memory modules alongside the network update iterations. In this way, labels and the network evolve shoulder-to-shoulder rather than alternatingly, leading to more stable representation learning. Second, we delve into another popular stream in discriminative representation learning, i.e., contrastive learning. We first devote our effort to understanding the importance of inter-image invariance for contrastive learning. To this end, we perform a rigorous and comprehensive empirical study on inter-image invariance learning from three main constituting components: pseudo-label maintenance, sampling strategy, and decision boundary design. Through carefully-designed comparisons and analysis, we propose InterCLR, a unified and generic unsupervised intra- and inter-image invariance learning framework to improve conventional contrastive learning that only relies on intra-image statistics. After integrating inter-image invariance into contrastive learning, we further pay attention to contrastive learning on non-iconic scene images. We tackle the discrepancy of image-level contrastive learning between object-centric and scene-centric images by proposing a multi-stage self-supervised learning pipeline, namely ORL, realizing object-level representation learning from scene images. Specifically, we leverage image-level self-supervised pre-training as the prior for object-level semantic correspondence discovery, and use the obtained correspondence to construct positive object-instance pairs for object-level contrastive learning. Third, we shift our attention to the discriminative counterpart, i.e., generative representation learning. We first focus on masked image modeling, which is a representative masked-prediction-based representation learning paradigm. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, we propose Masked Frequency Modeling (MFM) to perform image corruptions in the frequency domain. Specifically, we first mask out a portion of frequency components of the input image and then predict the missing frequencies on the frequency spectrum. MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. We further comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with the proposed MFM approach. Finally, apart from learning implicit visual features, we explore another form of visual representations, i.e., generating explicit visual data. In particular, we leverage the explicit data generation ability of generative models to address the data-hungry and label-expensive issues for certain downstream tasks like instance segmentation, especially for rare and novel categories. To this end, we propose a diffusion-based data augmentation approach, namely MosaicFusion, to generate a significant amount of object instances and mask annotations without further model training and label supervision. Specifically, we first divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. We then obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, MosaicFusion can produce a significant amount of synthetic data for both rare and novel categories, which can be regarded as a form of unsupervised visual representations distilled from the generative models. This thesis presents the aforementioned discriminative and generative solutions to learning effective and efficient visual representations without human supervision. Extensive experiments on various downstream tasks demonstrate that all proposed methods are able to boost the performances of their respective baselines significantly. Doctor of Philosophy 2023-11-07T08:28:44Z 2023-11-07T08:28:44Z 2023 Thesis-Doctor of Philosophy Xie, J. (2023). Learning visual representations without human supervision. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/171772 https://hdl.handle.net/10356/171772 10.32657/10356/171772 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Learning visual representations without human supervision

مواد مشابهة