Image and video super-resolution in the wild
With the increasing need for high-resolution content, there is a need to develop super-resolution techniques that improve the resolution of images and videos captured from non-professional imaging devices. Researchers have made incessant efforts to improve the resolution of images and videos to melio...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/160140 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | With the increasing need for high-resolution content, there is a need to develop super-resolution techniques that improve the resolution of images and videos captured from non-professional imaging devices. Researchers have made incessant efforts to improve the resolution of images and videos to meliorate user experience and enhance performance in downstream tasks. However, most existing approaches focus on designing an image-to-image mapping, failing in employing auxiliary information readily available in reality. As a result, such methods often possess suboptimal effectiveness and efficiency owing to inadequate information aggregation and large network complexity. In addition, it remains nontrivial to generalize to uncontrolled scenes, whose degradations could be complex, diverse, and unknown. This thesis proposes solutions for effective image and video super-resolution and generalization to real-world degradations through exploiting generative priors and temporal information.
The thesis first demonstrates that pre-trained Generative Adversarial Networks (GANs), e.g., StyleGAN, can be used as a latent bank to improve the restoration quality of large-factor image super-resolution (SR). Our method, Generative LatEnt bANk (GLEAN), goes beyond existing practices by directly leveraging rich and diverse priors encapsulated in a pre-trained GAN. GLEAN can be easily incorporated in a simple encoder-bank-decoder architecture with multi-resolution skip connections. Images upscaled by GLEAN show clear improvements in terms of fidelity and texture faithfulness compared to existing methods.
Second, we study the underlying mechanism of deformable alignment, which shows compelling performance in aligning multiple frames for video super-resolution. Specifically, we show that deformable convolution can be decomposed into a combination of spatial warping and convolution, revealing the commonality of deformable alignment and flow-based alignment in formulation, but with a key difference in their offset diversity. Based on our observations, we propose an offset-fidelity loss that guides the offset learning with optical flow. Experiments show that our loss successfully avoids the overflow of offsets and alleviates the instability problem of deformable alignment.
Third, we reconsider some most essential components for video super-resolution guided by four basic functionalities, i.e., Propagation, Alignment, Aggregation, and Upsampling. By reusing some existing components added with minimal redesigns, we show a succinct pipeline, BasicVSR, that achieves appealing improvements in terms of speed and restoration quality in comparison to many state-of-the-art algorithms. We conduct a systematic analysis to explain how such gain can be obtained and discuss the pitfalls. We further show the extensibility of BasicVSR by presenting IconVSR and BasicVSR++. IconVSR contains an information-refill mechanism to alleviate the error accumulation problem, and a coupled propagation to faciiliate information flow during propagation. BasicVSR++ further enhances propagation and alignment with second-order grid propagation and flow-guided deformable alignment. Our BasicVSR series significantly outperforms existing works in both efficiency and output quality.
Fourth, we provide solutions to tackle the unique challenges in real-world video super-resolution in inference and training, induced by the diversity and complexity of degradations. First, we introduce an image pre-cleaning stage to reduce noises and artifacts prior to propagation, substantially improving the output quality. Second, we provide analysis and solutions to the problems resulting from the increased computational burden in the task. In addition, to facilitate fair comparisons, we propose the new VideoLQ dataset, which contains a large variety of real-world low-quality video sequences containing rich textures and patterns. Our dataset can serve as a common ground for benchmarking. |
---|