Synthesizing photorealistic images with deep generative learning

The goal of this thesis is to present my research contributions towards solving various visual synthesis and generation tasks, comprising image translation, image completion, and completed scene decomposition. This thesis consists of five pieces of work, each of which presents a new learning-based a...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Zheng, Chuanxia
مؤلفون آخرون:	Cham Tat Jen
التنسيق:	Thesis-Doctor of Philosophy
اللغة:	English
منشور في:	Nanyang Technological University 2021
الموضوعات:	Engineering::Computer science and engineering
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/153008
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Nanyang Technological University
اللغة:	English

id	sg-ntu-dr.10356-153008
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering
spellingShingle	Engineering::Computer science and engineering Zheng, Chuanxia Synthesizing photorealistic images with deep generative learning
description	The goal of this thesis is to present my research contributions towards solving various visual synthesis and generation tasks, comprising image translation, image completion, and completed scene decomposition. This thesis consists of five pieces of work, each of which presents a new learning-based approach for synthesizing images with plausible content as well as visually realistic appearance. Each work demonstrates the superiority of the proposed approach on image synthesis, with some further contributing to other tasks, such as depth estimation. Part 1 describes methods for changing visual appearance. In particular, in Chapter 2, a synthetic-to-realistic translation system is presented to address the real-world single-image depth estimation, where only synthetic image-depth pairs and unpaired real images are used for training. This model provides a new perspective on a real-world estimation task by utilizing low-cost, yet high-reusable synthetic data. In Chapter 3, the focus is on general image-to-image (I2I) translation tasks, instead of narrowly synthetic-to-realistic image translation. A novel spatially-correlative loss is proposed that is simple, efficient and yet effective for preserving scene structure consistency, while supporting large appearance changes. Spatial patterns of self-similarity are exploited as a means of defining scene structure, with this spatially-correlative loss geared towards only capturing spatial relationships within an image, rather than domain appearance. The extensive experiment results demonstrate significant improvements using this content loss on several I2I tasks, including single-modal, multi-modal, and even single-image translation. Furthermore, this new loss can easily be integrated into existing network architectures and thus allows wide applicability. Part 2 presents approaches that generate semantically reasonable content for masked regions. Instead of purely modifying the local appearance as in Part 1, two approaches are presented to create new content as well as realistic appearance for a given image. In Chapter 4, a new task is introduced, called pluralistic image completion --- the task of generating multiple and diverse plausible results, which is as opposed to previous works that attempt to create only a single ``guess'' for this highly subjective problem. In this chapter, a novel probabilistically principled framework is proposed, which achieved state-of-the-art results for this new task and has become the benchmark for later works. However, my subsequent observation is that architectures based on convolutional neural networks (CNN) model long-range dependencies via many stacked layers, where holes are progressively influenced by neighboring pixels, resulting in some artifacts. To mitigate this issue, in Chapter 5 I propose treating image completion as a directionless sequence-to-sequence prediction task, and deploy a transformer to directly capture long-range dependencies in the encoder in a first phase. Crucially, a restrictive CNN with small and non-overlapping receptive fields (RF) is employed for token representation, which allows the transformer to explicitly model long-range context relations with equal importance in all layers, without implicitly confounding neighboring tokens when larger RFs are used. Extensive experiments demonstrate superior performance compared to previous CNN-based methods on several datasets. Part 3 combines recognitive learning and the latest generative modeling into a holistic scene decomposition and completion framework, where a network is trained to decompose a scene into individual objects, infer their underlying occlusion relationships, and moreover imagine what the originally occluded objects may look like, while using only a single image as input. In Chapter 6, the aim is to derive a higher-level structural decomposition of a scene, automatically recognizing objects and generating intact shapes as well as photorealistic appearances for occluded regions, without requiring manual masking as in Part 2. To achieve this goal, a new pipeline is presented that interleaves the two tasks of instance segmentation and scene completion through multiple iterations, solving for objects in a layer-by-layer manner. The proposed system shows significant improvement over the state-of-the-art methods and enables some interesting applications, such as scene editing and recomposition. In summary, the thesis introduces a series of works to synthesize photorealistic images by changing the appearance, imagining the semantic content, and inferring the invisible shape and appearance automatically.
author2	Cham Tat Jen
author_facet	Cham Tat Jen Zheng, Chuanxia
format	Thesis-Doctor of Philosophy
author	Zheng, Chuanxia
author_sort	Zheng, Chuanxia
title	Synthesizing photorealistic images with deep generative learning
title_short	Synthesizing photorealistic images with deep generative learning
title_full	Synthesizing photorealistic images with deep generative learning
title_fullStr	Synthesizing photorealistic images with deep generative learning
title_full_unstemmed	Synthesizing photorealistic images with deep generative learning
title_sort	synthesizing photorealistic images with deep generative learning
publisher	Nanyang Technological University
publishDate	2021
url	https://hdl.handle.net/10356/153008
_version_	1718368097984839680
spelling	sg-ntu-dr.10356-1530082021-11-05T06:03:43Z Synthesizing photorealistic images with deep generative learning Zheng, Chuanxia Cham Tat Jen School of Computer Science and Engineering ASTJCham@ntu.edu.sg Engineering::Computer science and engineering The goal of this thesis is to present my research contributions towards solving various visual synthesis and generation tasks, comprising image translation, image completion, and completed scene decomposition. This thesis consists of five pieces of work, each of which presents a new learning-based approach for synthesizing images with plausible content as well as visually realistic appearance. Each work demonstrates the superiority of the proposed approach on image synthesis, with some further contributing to other tasks, such as depth estimation. Part 1 describes methods for changing visual appearance. In particular, in Chapter 2, a synthetic-to-realistic translation system is presented to address the real-world single-image depth estimation, where only synthetic image-depth pairs and unpaired real images are used for training. This model provides a new perspective on a real-world estimation task by utilizing low-cost, yet high-reusable synthetic data. In Chapter 3, the focus is on general image-to-image (I2I) translation tasks, instead of narrowly synthetic-to-realistic image translation. A novel spatially-correlative loss is proposed that is simple, efficient and yet effective for preserving scene structure consistency, while supporting large appearance changes. Spatial patterns of self-similarity are exploited as a means of defining scene structure, with this spatially-correlative loss geared towards only capturing spatial relationships within an image, rather than domain appearance. The extensive experiment results demonstrate significant improvements using this content loss on several I2I tasks, including single-modal, multi-modal, and even single-image translation. Furthermore, this new loss can easily be integrated into existing network architectures and thus allows wide applicability. Part 2 presents approaches that generate semantically reasonable content for masked regions. Instead of purely modifying the local appearance as in Part 1, two approaches are presented to create new content as well as realistic appearance for a given image. In Chapter 4, a new task is introduced, called pluralistic image completion --- the task of generating multiple and diverse plausible results, which is as opposed to previous works that attempt to create only a single ``guess'' for this highly subjective problem. In this chapter, a novel probabilistically principled framework is proposed, which achieved state-of-the-art results for this new task and has become the benchmark for later works. However, my subsequent observation is that architectures based on convolutional neural networks (CNN) model long-range dependencies via many stacked layers, where holes are progressively influenced by neighboring pixels, resulting in some artifacts. To mitigate this issue, in Chapter 5 I propose treating image completion as a directionless sequence-to-sequence prediction task, and deploy a transformer to directly capture long-range dependencies in the encoder in a first phase. Crucially, a restrictive CNN with small and non-overlapping receptive fields (RF) is employed for token representation, which allows the transformer to explicitly model long-range context relations with equal importance in all layers, without implicitly confounding neighboring tokens when larger RFs are used. Extensive experiments demonstrate superior performance compared to previous CNN-based methods on several datasets. Part 3 combines recognitive learning and the latest generative modeling into a holistic scene decomposition and completion framework, where a network is trained to decompose a scene into individual objects, infer their underlying occlusion relationships, and moreover imagine what the originally occluded objects may look like, while using only a single image as input. In Chapter 6, the aim is to derive a higher-level structural decomposition of a scene, automatically recognizing objects and generating intact shapes as well as photorealistic appearances for occluded regions, without requiring manual masking as in Part 2. To achieve this goal, a new pipeline is presented that interleaves the two tasks of instance segmentation and scene completion through multiple iterations, solving for objects in a layer-by-layer manner. The proposed system shows significant improvement over the state-of-the-art methods and enables some interesting applications, such as scene editing and recomposition. In summary, the thesis introduces a series of works to synthesize photorealistic images by changing the appearance, imagining the semantic content, and inferring the invisible shape and appearance automatically. Doctor of Philosophy 2021-10-28T07:40:58Z 2021-10-28T07:40:58Z 2021 Thesis-Doctor of Philosophy Zheng, C. (2021). Synthesizing photorealistic images with deep generative learning. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153008 https://hdl.handle.net/10356/153008 10.32657/10356/153008 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Synthesizing photorealistic images with deep generative learning

مواد مشابهة