Deep generative modeling for image synthesis and manipulation

Generating or editing visual content always attracts great attention in research and applications. With the rise of deep neural networks (DNNs), deep generative models have defined a new state-of-the-art in various image generation and manipulation tasks. Despite the significant progress in image sy...

Full description

Saved in:
Bibliographic Details
Main Author: Yu, Yingchen
Other Authors: Lu Shijian
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/172957
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Generating or editing visual content always attracts great attention in research and applications. With the rise of deep neural networks (DNNs), deep generative models have defined a new state-of-the-art in various image generation and manipulation tasks. Despite the significant progress in image synthesis and manipulation, three major challenges remain: First, attaining both high-quality and diverse generations for complex scenes or objects remains a significant challenge. Second, achieving precise and flexible control over the generated content could be difficult, which may hinder the applications of image manipulation. Third, 3D awareness of generated content is limited, which is essential for generating more realistic and contextually accurate content that adheres to the rules of the real world. As a subtask of conditional generation, image inpainting treats corrupted or masked input images as conditions. With the recent advance of GANs, GAN-based methods have been widely explored for deterministic image inpainting. However, the existing methods often suffer from two issues. First, they tend to adopt a hybrid objective of reconstruction and perceptual quality which often leads to inter-frequency conflicts and compromised inpainting. Second, the inpainting results of such GAN-based methods are usually deterministic, which does not reflect the one-to-many nature of the inpainting task. On the other hand, we explore another subtask, exemplar-based image translation, which translates an input image into a target domain while using one or many reference images (exemplars) from the target domain as guidance. The user can interact with the input image (e.g. edges, semantic maps) to achieve image manipulation. Its main challenge is to achieve more accurate style guidance and preserve details from exemplars. Next, we explore text-guided image manipulation, which allows users to manipulate the image via natural language descriptions. Leveraging StyleGAN's expressivity and its disentangled latent codes, existing methods can achieve realistic editing of different visual attributes. However, they are often limited to in-domain editing and lack of 3D awareness. In this thesis, we propose several novel techniques for image synthesis and manipulation that aim to generate more realistic and diverse images, as well as achieve more accurate and flexible image manipulation. First, we design a wavelet-based inpainting network that decomposes images into multiple frequency bands and fills the missing regions in each frequency band separately and explicitly. It effectively mitigates inter-frequency conflicts while completing images in the spatial domain, thus improving the inpainting quality. Second, we introduce an innovative image inpainting framework that introduces a novel bidirectional autoregressive transformer (BAT) for diverse image inpainting. Third, we design a bi-level feature alignment framework to efficiently and accurately build the dense correspondence between conditional inputs and exemplars, which leads to more accurate style guidance and better preservation of details. Fourth, to achieve counterfactual editing without the need for additional training data, we design a framework to comprehensively explore the rich semantic knowledge of the large-scale pre-trained model Contrastive Language Image Pretraining (CLIP). Fifth, we propose a learnable morphing network that morphs the 3D geometry of images toward the target descriptions via generative neural radiance field (NeRF), which achieves accurate and controllable 3D-aware image editing. Extensive experiments demonstrate that the proposed techniques effectively tackle the aforementioned issues respectively. Overall, the proposed methods have effectively addressed or mitigated the aforementioned issues and extended the boundary of realistic image synthesis and manipulation. Although they focus on specific subtasks, the main ideas could be insightful or helpful for other generation tasks.