Synthesizing photorealistic images with deep generative learning

The goal of this thesis is to present my research contributions towards solving various visual synthesis and generation tasks, comprising image translation, image completion, and completed scene decomposition. This thesis consists of five pieces of work, each of which presents a new learning-based a...

Full description

Saved in:
Bibliographic Details
Main Author: Zheng, Chuanxia
Other Authors: Cham Tat Jen
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/153008
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-153008
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Zheng, Chuanxia
Synthesizing photorealistic images with deep generative learning
description The goal of this thesis is to present my research contributions towards solving various visual synthesis and generation tasks, comprising image translation, image completion, and completed scene decomposition. This thesis consists of five pieces of work, each of which presents a new learning-based approach for synthesizing images with plausible content as well as visually realistic appearance. Each work demonstrates the superiority of the proposed approach on image synthesis, with some further contributing to other tasks, such as depth estimation. Part 1 describes methods for changing visual appearance. In particular, in Chapter 2, a synthetic-to-realistic translation system is presented to address the real-world single-image depth estimation, where only synthetic image-depth pairs and unpaired real images are used for training. This model provides a new perspective on a real-world estimation task by utilizing low-cost, yet high-reusable synthetic data. In Chapter 3, the focus is on general image-to-image (I2I) translation tasks, instead of narrowly synthetic-to-realistic image translation. A novel spatially-correlative loss is proposed that is simple, efficient and yet effective for preserving scene structure consistency, while supporting large appearance changes. Spatial patterns of self-similarity are exploited as a means of defining scene structure, with this spatially-correlative loss geared towards only capturing spatial relationships within an image, rather than domain appearance. The extensive experiment results demonstrate significant improvements using this content loss on several I2I tasks, including single-modal, multi-modal, and even single-image translation. Furthermore, this new loss can easily be integrated into existing network architectures and thus allows wide applicability. Part 2 presents approaches that generate semantically reasonable content for masked regions. Instead of purely modifying the local appearance as in Part 1, two approaches are presented to create new content as well as realistic appearance for a given image. In Chapter 4, a new task is introduced, called pluralistic image completion --- the task of generating multiple and diverse plausible results, which is as opposed to previous works that attempt to create only a single ``guess'' for this highly subjective problem. In this chapter, a novel probabilistically principled framework is proposed, which achieved state-of-the-art results for this new task and has become the benchmark for later works. However, my subsequent observation is that architectures based on convolutional neural networks (CNN) model long-range dependencies via many stacked layers, where holes are progressively influenced by neighboring pixels, resulting in some artifacts. To mitigate this issue, in Chapter 5 I propose treating image completion as a directionless sequence-to-sequence prediction task, and deploy a transformer to directly capture long-range dependencies in the encoder in a first phase. Crucially, a restrictive CNN with small and non-overlapping receptive fields (RF) is employed for token representation, which allows the transformer to explicitly model long-range context relations with equal importance in all layers, without implicitly confounding neighboring tokens when larger RFs are used. Extensive experiments demonstrate superior performance compared to previous CNN-based methods on several datasets. Part 3 combines recognitive learning and the latest generative modeling into a holistic scene decomposition and completion framework, where a network is trained to decompose a scene into individual objects, infer their underlying occlusion relationships, and moreover imagine what the originally occluded objects may look like, while using only a single image as input. In Chapter 6, the aim is to derive a higher-level structural decomposition of a scene, automatically recognizing objects and generating intact shapes as well as photorealistic appearances for occluded regions, without requiring manual masking as in Part 2. To achieve this goal, a new pipeline is presented that interleaves the two tasks of instance segmentation and scene completion through multiple iterations, solving for objects in a layer-by-layer manner. The proposed system shows significant improvement over the state-of-the-art methods and enables some interesting applications, such as scene editing and recomposition. In summary, the thesis introduces a series of works to synthesize photorealistic images by changing the appearance, imagining the semantic content, and inferring the invisible shape and appearance automatically.
author2 Cham Tat Jen
author_facet Cham Tat Jen
Zheng, Chuanxia
format Thesis-Doctor of Philosophy
author Zheng, Chuanxia
author_sort Zheng, Chuanxia
title Synthesizing photorealistic images with deep generative learning
title_short Synthesizing photorealistic images with deep generative learning
title_full Synthesizing photorealistic images with deep generative learning
title_fullStr Synthesizing photorealistic images with deep generative learning
title_full_unstemmed Synthesizing photorealistic images with deep generative learning
title_sort synthesizing photorealistic images with deep generative learning
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/153008
_version_ 1718368097984839680
spelling sg-ntu-dr.10356-1530082021-11-05T06:03:43Z Synthesizing photorealistic images with deep generative learning Zheng, Chuanxia Cham Tat Jen School of Computer Science and Engineering ASTJCham@ntu.edu.sg Engineering::Computer science and engineering The goal of this thesis is to present my research contributions towards solving various visual synthesis and generation tasks, comprising image translation, image completion, and completed scene decomposition. This thesis consists of five pieces of work, each of which presents a new learning-based approach for synthesizing images with plausible content as well as visually realistic appearance. Each work demonstrates the superiority of the proposed approach on image synthesis, with some further contributing to other tasks, such as depth estimation. Part 1 describes methods for changing visual appearance. In particular, in Chapter 2, a synthetic-to-realistic translation system is presented to address the real-world single-image depth estimation, where only synthetic image-depth pairs and unpaired real images are used for training. This model provides a new perspective on a real-world estimation task by utilizing low-cost, yet high-reusable synthetic data. In Chapter 3, the focus is on general image-to-image (I2I) translation tasks, instead of narrowly synthetic-to-realistic image translation. A novel spatially-correlative loss is proposed that is simple, efficient and yet effective for preserving scene structure consistency, while supporting large appearance changes. Spatial patterns of self-similarity are exploited as a means of defining scene structure, with this spatially-correlative loss geared towards only capturing spatial relationships within an image, rather than domain appearance. The extensive experiment results demonstrate significant improvements using this content loss on several I2I tasks, including single-modal, multi-modal, and even single-image translation. Furthermore, this new loss can easily be integrated into existing network architectures and thus allows wide applicability. Part 2 presents approaches that generate semantically reasonable content for masked regions. Instead of purely modifying the local appearance as in Part 1, two approaches are presented to create new content as well as realistic appearance for a given image. In Chapter 4, a new task is introduced, called pluralistic image completion --- the task of generating multiple and diverse plausible results, which is as opposed to previous works that attempt to create only a single ``guess'' for this highly subjective problem. In this chapter, a novel probabilistically principled framework is proposed, which achieved state-of-the-art results for this new task and has become the benchmark for later works. However, my subsequent observation is that architectures based on convolutional neural networks (CNN) model long-range dependencies via many stacked layers, where holes are progressively influenced by neighboring pixels, resulting in some artifacts. To mitigate this issue, in Chapter 5 I propose treating image completion as a directionless sequence-to-sequence prediction task, and deploy a transformer to directly capture long-range dependencies in the encoder in a first phase. Crucially, a restrictive CNN with small and non-overlapping receptive fields (RF) is employed for token representation, which allows the transformer to explicitly model long-range context relations with equal importance in all layers, without implicitly confounding neighboring tokens when larger RFs are used. Extensive experiments demonstrate superior performance compared to previous CNN-based methods on several datasets. Part 3 combines recognitive learning and the latest generative modeling into a holistic scene decomposition and completion framework, where a network is trained to decompose a scene into individual objects, infer their underlying occlusion relationships, and moreover imagine what the originally occluded objects may look like, while using only a single image as input. In Chapter 6, the aim is to derive a higher-level structural decomposition of a scene, automatically recognizing objects and generating intact shapes as well as photorealistic appearances for occluded regions, without requiring manual masking as in Part 2. To achieve this goal, a new pipeline is presented that interleaves the two tasks of instance segmentation and scene completion through multiple iterations, solving for objects in a layer-by-layer manner. The proposed system shows significant improvement over the state-of-the-art methods and enables some interesting applications, such as scene editing and recomposition. In summary, the thesis introduces a series of works to synthesize photorealistic images by changing the appearance, imagining the semantic content, and inferring the invisible shape and appearance automatically. Doctor of Philosophy 2021-10-28T07:40:58Z 2021-10-28T07:40:58Z 2021 Thesis-Doctor of Philosophy Zheng, C. (2021). Synthesizing photorealistic images with deep generative learning. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153008 https://hdl.handle.net/10356/153008 10.32657/10356/153008 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University