Deep generative modeling for image synthesis and manipulation
Generating or editing visual content always attracts great attention in research and applications. With the rise of deep neural networks (DNNs), deep generative models have defined a new state-of-the-art in various image generation and manipulation tasks. Despite the significant progress in image sy...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/172957 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-172957 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering |
spellingShingle |
Engineering::Computer science and engineering Yu, Yingchen Deep generative modeling for image synthesis and manipulation |
description |
Generating or editing visual content always attracts great attention in research and applications. With the rise of deep neural networks (DNNs), deep generative models have defined a new state-of-the-art in various image generation and manipulation tasks. Despite the significant progress in image synthesis and manipulation, three major challenges remain: First, attaining both high-quality and diverse generations for complex scenes or objects remains a significant challenge. Second, achieving precise and flexible control over the generated content could be difficult, which may hinder the applications of image manipulation. Third, 3D awareness of generated content is limited, which is essential for generating more realistic and contextually accurate content that adheres to the rules of the real world.
As a subtask of conditional generation, image inpainting treats corrupted or masked input images as conditions. With the recent advance of GANs, GAN-based methods have been widely explored for deterministic image inpainting. However, the existing methods often suffer from two issues. First, they tend to adopt a hybrid objective of reconstruction and perceptual quality which often leads to inter-frequency conflicts and compromised inpainting. Second, the inpainting results of such GAN-based methods are usually deterministic, which does not reflect the one-to-many nature of the inpainting task. On the other hand, we explore another subtask, exemplar-based image translation, which translates an input image into a target domain while using one or many reference images (exemplars) from the target domain as guidance. The user can interact with the input image (e.g. edges, semantic maps) to achieve image manipulation. Its main challenge is to achieve more accurate style guidance and preserve details from exemplars. Next, we explore text-guided image manipulation, which allows users to manipulate the image via natural language descriptions. Leveraging StyleGAN's expressivity and its disentangled latent codes, existing methods can achieve realistic editing of different visual attributes. However, they are often limited to in-domain editing and lack of 3D awareness.
In this thesis, we propose several novel techniques for image synthesis and manipulation that aim to generate more realistic and diverse images, as well as achieve more accurate and flexible image manipulation. First, we design a wavelet-based inpainting network that decomposes images into multiple frequency bands and fills the missing regions in each frequency band separately and explicitly. It effectively mitigates inter-frequency conflicts while completing images in the spatial domain, thus improving the inpainting quality. Second, we introduce an innovative image inpainting framework that introduces a novel bidirectional autoregressive transformer (BAT) for diverse image inpainting. Third, we design a bi-level feature alignment framework to efficiently and accurately build the dense correspondence between conditional inputs and exemplars, which leads to more accurate style guidance and better preservation of details. Fourth, to achieve counterfactual editing without the need for additional training data, we design a framework to comprehensively explore the rich semantic knowledge of the large-scale pre-trained model Contrastive Language Image Pretraining (CLIP). Fifth, we propose a learnable morphing network that morphs the 3D geometry of images toward the target descriptions via generative neural radiance field (NeRF), which achieves accurate and controllable 3D-aware image editing. Extensive experiments demonstrate that the proposed techniques effectively tackle the aforementioned issues respectively.
Overall, the proposed methods have effectively addressed or mitigated the aforementioned issues and extended the boundary of realistic image synthesis and manipulation. Although they focus on specific subtasks, the main ideas could be insightful or helpful for other generation tasks. |
author2 |
Lu Shijian |
author_facet |
Lu Shijian Yu, Yingchen |
format |
Thesis-Doctor of Philosophy |
author |
Yu, Yingchen |
author_sort |
Yu, Yingchen |
title |
Deep generative modeling for image synthesis and manipulation |
title_short |
Deep generative modeling for image synthesis and manipulation |
title_full |
Deep generative modeling for image synthesis and manipulation |
title_fullStr |
Deep generative modeling for image synthesis and manipulation |
title_full_unstemmed |
Deep generative modeling for image synthesis and manipulation |
title_sort |
deep generative modeling for image synthesis and manipulation |
publisher |
Nanyang Technological University |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/172957 |
_version_ |
1789968686763737088 |
spelling |
sg-ntu-dr.10356-1729572024-02-01T09:53:44Z Deep generative modeling for image synthesis and manipulation Yu, Yingchen Lu Shijian School of Computer Science and Engineering Shijian.Lu@ntu.edu.sg Engineering::Computer science and engineering Generating or editing visual content always attracts great attention in research and applications. With the rise of deep neural networks (DNNs), deep generative models have defined a new state-of-the-art in various image generation and manipulation tasks. Despite the significant progress in image synthesis and manipulation, three major challenges remain: First, attaining both high-quality and diverse generations for complex scenes or objects remains a significant challenge. Second, achieving precise and flexible control over the generated content could be difficult, which may hinder the applications of image manipulation. Third, 3D awareness of generated content is limited, which is essential for generating more realistic and contextually accurate content that adheres to the rules of the real world. As a subtask of conditional generation, image inpainting treats corrupted or masked input images as conditions. With the recent advance of GANs, GAN-based methods have been widely explored for deterministic image inpainting. However, the existing methods often suffer from two issues. First, they tend to adopt a hybrid objective of reconstruction and perceptual quality which often leads to inter-frequency conflicts and compromised inpainting. Second, the inpainting results of such GAN-based methods are usually deterministic, which does not reflect the one-to-many nature of the inpainting task. On the other hand, we explore another subtask, exemplar-based image translation, which translates an input image into a target domain while using one or many reference images (exemplars) from the target domain as guidance. The user can interact with the input image (e.g. edges, semantic maps) to achieve image manipulation. Its main challenge is to achieve more accurate style guidance and preserve details from exemplars. Next, we explore text-guided image manipulation, which allows users to manipulate the image via natural language descriptions. Leveraging StyleGAN's expressivity and its disentangled latent codes, existing methods can achieve realistic editing of different visual attributes. However, they are often limited to in-domain editing and lack of 3D awareness. In this thesis, we propose several novel techniques for image synthesis and manipulation that aim to generate more realistic and diverse images, as well as achieve more accurate and flexible image manipulation. First, we design a wavelet-based inpainting network that decomposes images into multiple frequency bands and fills the missing regions in each frequency band separately and explicitly. It effectively mitigates inter-frequency conflicts while completing images in the spatial domain, thus improving the inpainting quality. Second, we introduce an innovative image inpainting framework that introduces a novel bidirectional autoregressive transformer (BAT) for diverse image inpainting. Third, we design a bi-level feature alignment framework to efficiently and accurately build the dense correspondence between conditional inputs and exemplars, which leads to more accurate style guidance and better preservation of details. Fourth, to achieve counterfactual editing without the need for additional training data, we design a framework to comprehensively explore the rich semantic knowledge of the large-scale pre-trained model Contrastive Language Image Pretraining (CLIP). Fifth, we propose a learnable morphing network that morphs the 3D geometry of images toward the target descriptions via generative neural radiance field (NeRF), which achieves accurate and controllable 3D-aware image editing. Extensive experiments demonstrate that the proposed techniques effectively tackle the aforementioned issues respectively. Overall, the proposed methods have effectively addressed or mitigated the aforementioned issues and extended the boundary of realistic image synthesis and manipulation. Although they focus on specific subtasks, the main ideas could be insightful or helpful for other generation tasks. Doctor of Philosophy 2024-01-08T05:01:52Z 2024-01-08T05:01:52Z 2023 Thesis-Doctor of Philosophy Yu, Y. (2023). Deep generative modeling for image synthesis and manipulation. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/172957 https://hdl.handle.net/10356/172957 10.32657/10356/172957 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |