Learning to control visual data translation

With the advancements in deep learning models such as Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs), the generation of high-dimensional data such as images, videos, etc. has achieved photo-realistic results with image-to-image translation advances. In particular, th...

Full description

Saved in:
Bibliographic Details
Main Author: Koksal, Ali
Other Authors: Deepu Rajan
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/165566
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:With the advancements in deep learning models such as Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs), the generation of high-dimensional data such as images, videos, etc. has achieved photo-realistic results with image-to-image translation advances. In particular, the ability to control the translation is an important aspect of modifying and synthesis novel contents. In this thesis, we address controllable high-dimensional data translation that enables three variations: (i) unpaired image-to-image translation, (ii) motion controllable video generation, and (iii) motion-aware mask-to-frame translation. In the unpaired image-to-image translation, images in a source domain are translated to a target domain where images often have different characteristics such as colors, styles, etc, with the absence of paired images in the training set. Learning to map between domains is challenging without having useful supervision that is provided by the paired images. In addition, state-of-the-art GANs for unpaired image-to-image translation are often constrained by large-sized models. In order to tackle the challenges, we introduce reconfigurable generator inspired by the observation that the mappings between two domains are often approximately invertible and multi-domain discriminator allows joint discrimination of original and translated samples from different domains. We propose two compact models that employ the reconfigurable generator and the multi-domain discriminator. The first proposed model, Reconfigurable Generative Adversarial Network (RF-GAN), achieves high-fidelity translation consistently with up to 88\% more compact model as compared with state-of-the-art GANs. The second model, Transformer-based Reconfigurable Generative Adversarial Network (TRF-GAN), replaces certain convolutions of RF-GAN's generator with transformers and further improves the translation performance with a more compact model which has approximately 25\% fewer parameters than RF-GAN. Motion controllable video generation is a variant of high-dimensional data translation where an initial frame is translated to next frames by controlling the motion of the object of interest. We address this variant as text-based control over action performed on the generated video. Building a semantic association between instructions and motion is indeed challenging because text descriptions are often ambiguous for video generation. In order to overcome the challenges, we introduce a novel framework, named Controllable Video Generation with text-based Instructions (CVGI) that allows text-based control over action performed on a video. By incorporating the motion estimation layer, the proposed framework divides the task into two subtasks: (i) control signal estimation and (ii) action generation. In control signal estimation, an encoder models actions as a set of simple motions by estimating low-level control signals for text-based instructions with given initial frames. In action generation, we employ a GAN to generate realistic videos conditioned on the estimated low-level signal. Evaluations on several datasets show the effectiveness of CVGI in generating realistic videos and in the control over actions. Although CVGI can generate realistic videos that correspond well with instructions and can control the motion according to instructions, it is limited to generating egocentric videos. Egocentric videos typically are shot by a head-mounted camera, which creates a lot of movements and causes dynamic scenes. Thus, we introduce motion-aware mask-to-frame translation where the mask of the object of interest in the next frame is translated to synthesize the next frame that should be consistent with the initial frame. In order to address the limitation of CVGI in egocentric video generation, we extend CVGI with motion-aware mask-to-frame translation where the next frame is translated from its mask by using the initial frame and mask of the object of interest in the initial frame as additional supervision. The proposed GAN uses additional supervision as input in the generator and by incorporating three discriminators that are trained to distinguish the whole frames, the object of interest, and background as real and generated. In order to generate motion controllable egocentric videos, first, masks of the object of interest are generated that correspond well with the text-based instructions, then masks of the object of interest are translated to frames by using motion-aware mask-to-frame translation GAN. Evaluations on a publicly available egocentric dataset show that the proposed GAN is capable to hallucinate pixels at the location of the object of interest on the initial frame that is indicated by the initial mask and create the object of interest at the new location that is indicated by the next mask. To sum up, we design innovative models for three variations of controllable high-dimensional data translation where a mapping function is trained to translate input high-dimensional data to novel high-dimensional data by controlling with conditions. In the scope of this thesis, we use images, frames, and the mask of the object of interest as high-dimensional data as input. We evaluate our models on benchmark datasets to show the effectiveness of the proposed frameworks. Simulating different motions can be used to train robotic systems such as robotic arms without requiring data collection of all possible motions. By predicting possible variants of future with different motions, controllable video generation also can provide useful insight for intelligent decision-making systems such as driver assistance systems and autonomous drones.