Deep generative modeling for image synthesis and manipulation

Generating or editing visual content always attracts great attention in research and applications. With the rise of deep neural networks (DNNs), deep generative models have defined a new state-of-the-art in various image generation and manipulation tasks. Despite the significant progress in image sy...

Full description

Saved in:

Bibliographic Details
Main Author:	Yu, Yingchen
Other Authors:	Lu Shijian
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/172957
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-172957
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering
spellingShingle	Engineering::Computer science and engineering Yu, Yingchen Deep generative modeling for image synthesis and manipulation
description	Generating or editing visual content always attracts great attention in research and applications. With the rise of deep neural networks (DNNs), deep generative models have defined a new state-of-the-art in various image generation and manipulation tasks. Despite the significant progress in image synthesis and manipulation, three major challenges remain: First, attaining both high-quality and diverse generations for complex scenes or objects remains a significant challenge. Second, achieving precise and flexible control over the generated content could be difficult, which may hinder the applications of image manipulation. Third, 3D awareness of generated content is limited, which is essential for generating more realistic and contextually accurate content that adheres to the rules of the real world. As a subtask of conditional generation, image inpainting treats corrupted or masked input images as conditions. With the recent advance of GANs, GAN-based methods have been widely explored for deterministic image inpainting. However, the existing methods often suffer from two issues. First, they tend to adopt a hybrid objective of reconstruction and perceptual quality which often leads to inter-frequency conflicts and compromised inpainting. Second, the inpainting results of such GAN-based methods are usually deterministic, which does not reflect the one-to-many nature of the inpainting task. On the other hand, we explore another subtask, exemplar-based image translation, which translates an input image into a target domain while using one or many reference images (exemplars) from the target domain as guidance. The user can interact with the input image (e.g. edges, semantic maps) to achieve image manipulation. Its main challenge is to achieve more accurate style guidance and preserve details from exemplars. Next, we explore text-guided image manipulation, which allows users to manipulate the image via natural language descriptions. Leveraging StyleGAN's expressivity and its disentangled latent codes, existing methods can achieve realistic editing of different visual attributes. However, they are often limited to in-domain editing and lack of 3D awareness. In this thesis, we propose several novel techniques for image synthesis and manipulation that aim to generate more realistic and diverse images, as well as achieve more accurate and flexible image manipulation. First, we design a wavelet-based inpainting network that decomposes images into multiple frequency bands and fills the missing regions in each frequency band separately and explicitly. It effectively mitigates inter-frequency conflicts while completing images in the spatial domain, thus improving the inpainting quality. Second, we introduce an innovative image inpainting framework that introduces a novel bidirectional autoregressive transformer (BAT) for diverse image inpainting. Third, we design a bi-level feature alignment framework to efficiently and accurately build the dense correspondence between conditional inputs and exemplars, which leads to more accurate style guidance and better preservation of details. Fourth, to achieve counterfactual editing without the need for additional training data, we design a framework to comprehensively explore the rich semantic knowledge of the large-scale pre-trained model Contrastive Language Image Pretraining (CLIP). Fifth, we propose a learnable morphing network that morphs the 3D geometry of images toward the target descriptions via generative neural radiance field (NeRF), which achieves accurate and controllable 3D-aware image editing. Extensive experiments demonstrate that the proposed techniques effectively tackle the aforementioned issues respectively. Overall, the proposed methods have effectively addressed or mitigated the aforementioned issues and extended the boundary of realistic image synthesis and manipulation. Although they focus on specific subtasks, the main ideas could be insightful or helpful for other generation tasks.
author2	Lu Shijian
author_facet	Lu Shijian Yu, Yingchen
format	Thesis-Doctor of Philosophy
author	Yu, Yingchen
author_sort	Yu, Yingchen
title	Deep generative modeling for image synthesis and manipulation
title_short	Deep generative modeling for image synthesis and manipulation
title_full	Deep generative modeling for image synthesis and manipulation
title_fullStr	Deep generative modeling for image synthesis and manipulation
title_full_unstemmed	Deep generative modeling for image synthesis and manipulation
title_sort	deep generative modeling for image synthesis and manipulation
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/172957
_version_	1789968686763737088
spelling	sg-ntu-dr.10356-1729572024-02-01T09:53:44Z Deep generative modeling for image synthesis and manipulation Yu, Yingchen Lu Shijian School of Computer Science and Engineering Shijian.Lu@ntu.edu.sg Engineering::Computer science and engineering Generating or editing visual content always attracts great attention in research and applications. With the rise of deep neural networks (DNNs), deep generative models have defined a new state-of-the-art in various image generation and manipulation tasks. Despite the significant progress in image synthesis and manipulation, three major challenges remain: First, attaining both high-quality and diverse generations for complex scenes or objects remains a significant challenge. Second, achieving precise and flexible control over the generated content could be difficult, which may hinder the applications of image manipulation. Third, 3D awareness of generated content is limited, which is essential for generating more realistic and contextually accurate content that adheres to the rules of the real world. As a subtask of conditional generation, image inpainting treats corrupted or masked input images as conditions. With the recent advance of GANs, GAN-based methods have been widely explored for deterministic image inpainting. However, the existing methods often suffer from two issues. First, they tend to adopt a hybrid objective of reconstruction and perceptual quality which often leads to inter-frequency conflicts and compromised inpainting. Second, the inpainting results of such GAN-based methods are usually deterministic, which does not reflect the one-to-many nature of the inpainting task. On the other hand, we explore another subtask, exemplar-based image translation, which translates an input image into a target domain while using one or many reference images (exemplars) from the target domain as guidance. The user can interact with the input image (e.g. edges, semantic maps) to achieve image manipulation. Its main challenge is to achieve more accurate style guidance and preserve details from exemplars. Next, we explore text-guided image manipulation, which allows users to manipulate the image via natural language descriptions. Leveraging StyleGAN's expressivity and its disentangled latent codes, existing methods can achieve realistic editing of different visual attributes. However, they are often limited to in-domain editing and lack of 3D awareness. In this thesis, we propose several novel techniques for image synthesis and manipulation that aim to generate more realistic and diverse images, as well as achieve more accurate and flexible image manipulation. First, we design a wavelet-based inpainting network that decomposes images into multiple frequency bands and fills the missing regions in each frequency band separately and explicitly. It effectively mitigates inter-frequency conflicts while completing images in the spatial domain, thus improving the inpainting quality. Second, we introduce an innovative image inpainting framework that introduces a novel bidirectional autoregressive transformer (BAT) for diverse image inpainting. Third, we design a bi-level feature alignment framework to efficiently and accurately build the dense correspondence between conditional inputs and exemplars, which leads to more accurate style guidance and better preservation of details. Fourth, to achieve counterfactual editing without the need for additional training data, we design a framework to comprehensively explore the rich semantic knowledge of the large-scale pre-trained model Contrastive Language Image Pretraining (CLIP). Fifth, we propose a learnable morphing network that morphs the 3D geometry of images toward the target descriptions via generative neural radiance field (NeRF), which achieves accurate and controllable 3D-aware image editing. Extensive experiments demonstrate that the proposed techniques effectively tackle the aforementioned issues respectively. Overall, the proposed methods have effectively addressed or mitigated the aforementioned issues and extended the boundary of realistic image synthesis and manipulation. Although they focus on specific subtasks, the main ideas could be insightful or helpful for other generation tasks. Doctor of Philosophy 2024-01-08T05:01:52Z 2024-01-08T05:01:52Z 2023 Thesis-Doctor of Philosophy Yu, Y. (2023). Deep generative modeling for image synthesis and manipulation. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/172957 https://hdl.handle.net/10356/172957 10.32657/10356/172957 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Deep generative modeling for image synthesis and manipulation

Similar Items