Dense prediction and deep learning in complex visual scenes

Many computer vision applications, such as video surveillance, autonomous driving, and crowd analysis, suffer from the challenging conditions of complex scenes, including haze, underwater, extreme lighting, and crowded and small objects. These scenes might degrade or compromise the performance of or...

Full description

Saved in:
Bibliographic Details
Main Author: Wang, Yi
Other Authors: Lap-Pui Chau
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/152009
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-152009
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
Engineering::Electrical and electronic engineering
spellingShingle Engineering::Computer science and engineering
Engineering::Electrical and electronic engineering
Wang, Yi
Dense prediction and deep learning in complex visual scenes
description Many computer vision applications, such as video surveillance, autonomous driving, and crowd analysis, suffer from the challenging conditions of complex scenes, including haze, underwater, extreme lighting, and crowded and small objects. These scenes might degrade or compromise the performance of or even fail computer vision algorithms. It is valuable and important to develop methods to address such complex visual scenes. In this thesis, we follow a unified thinking for a series of dense prediction problems from low-level vision to high-level vision, i.e., restoration, detection, and recognition. In the restoration problem, haze and underwater scenes degrade the contrast and color of images due to light scattering and absorption. Research on de-scattering or dehazing refers to restore images captured in such scenes. As the first research direction of this thesis, we propose a novel image restoration approach for underwater imagery based on an adaptive attenuation-curve prior (AACP). The prior describes the fact that all pixel values of a clear image can be partitioned into several hundred distinct clusters in RGB space, and the pixel values in each cluster will be distributed on a curve with a power-function form after attenuated by water. Therefore, the pixel-wise medium transmission can be predicted according to the pixel value's distribution on such a curve. This method is generalizable and can be extended to hazy images. Moreover, according to the fact that ambient light exists in the infinite distant region of an outdoor image, we propose a new deep learning-based framework to estimate the ambient light by distant region segmentation (DRS). Qualitative and quantitative results show that the proposed methods achieve superior performance in comparison with state-of-the-art methods. In the detection problem, crowded objects present large-scale variation and severe occlusion, posing great challenges to object detectors. In addition, current crowd datasets only provide coarse point-level annotations, i.e., human heads are labeled as points, so state-of-the-art object detectors cannot be trivially applied to such point supervision. In our second research direction, we propose a novel self-training approach that enables a typical object detector trained only with point-level annotations to densely predict center points and sizes of crowded objects, termed Crowd-DCNet. Specifically, we propose the locally-uniform distribution assumption (LUDA) for initializing pseudo object sizes from point-level supervisory information, the crowdedness-aware loss for regressing object sizes, and the confidence and order-aware refinement scheme for refining the pseudo object sizes continuously during training. With our self-training approach, the ability of the detector is increasingly boosted. Moreover, bypassing object detection, we introduce a compact convolutional neural network (CNN) for object counting in video surveillance, in which a multi-scale density (MSD) regressor is employed to predict the coarse- and fine-scale density maps. The comprehensive experimental results on six challenging benchmark datasets show that our approach significantly outperforms state-of-the-art methods under both detection and counting tasks. In the recognition problem, small objects in unconstrained scenes adversely affect the accuracy of automatic recognition systems. Our third research direction focuses on automatic license plate recognition (ALPR) in unconstrained environments, such as oblique views, uneven illumination, and various weather conditions. Our study produces an outstanding design of ALPR with four insights: (1) the resampling-based cascaded framework is beneficial to both speed and accuracy; (2) the highly efficient license plate recognition should abandon additional character segmentation and recurrent neural network (RNN), but adopt a plain CNN; (3) in the case of CNN, taking advantage of vertex information on license plates improves recognition performance; and (4) the weight-sharing character classifier addresses the lack of training images in small-scale datasets. Based on these insights, we propose a real-time and high-performing ALPR approach, termed VSNet. The vertex supervisory information is fully exploited for training a detector (VertexNet) to predict the geometric shapes of license plates such that license plates can be rectified and their characters can be densely predicted by a recognizer (SCR-Net). Moreover, we propose a dynamic regularization method to avoid overfitting and improve the generalization ability of CNN. Experimental results on two challenging benchmark datasets demonstrate the effectiveness of the proposed method.
author2 Lap-Pui Chau
author_facet Lap-Pui Chau
Wang, Yi
format Thesis-Doctor of Philosophy
author Wang, Yi
author_sort Wang, Yi
title Dense prediction and deep learning in complex visual scenes
title_short Dense prediction and deep learning in complex visual scenes
title_full Dense prediction and deep learning in complex visual scenes
title_fullStr Dense prediction and deep learning in complex visual scenes
title_full_unstemmed Dense prediction and deep learning in complex visual scenes
title_sort dense prediction and deep learning in complex visual scenes
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/152009
_version_ 1772826553906888704
spelling sg-ntu-dr.10356-1520092023-07-04T17:02:00Z Dense prediction and deep learning in complex visual scenes Wang, Yi Lap-Pui Chau School of Electrical and Electronic Engineering Centre for Information Sciences and Systems elpchau@ntu.edu.sg Engineering::Computer science and engineering Engineering::Electrical and electronic engineering Many computer vision applications, such as video surveillance, autonomous driving, and crowd analysis, suffer from the challenging conditions of complex scenes, including haze, underwater, extreme lighting, and crowded and small objects. These scenes might degrade or compromise the performance of or even fail computer vision algorithms. It is valuable and important to develop methods to address such complex visual scenes. In this thesis, we follow a unified thinking for a series of dense prediction problems from low-level vision to high-level vision, i.e., restoration, detection, and recognition. In the restoration problem, haze and underwater scenes degrade the contrast and color of images due to light scattering and absorption. Research on de-scattering or dehazing refers to restore images captured in such scenes. As the first research direction of this thesis, we propose a novel image restoration approach for underwater imagery based on an adaptive attenuation-curve prior (AACP). The prior describes the fact that all pixel values of a clear image can be partitioned into several hundred distinct clusters in RGB space, and the pixel values in each cluster will be distributed on a curve with a power-function form after attenuated by water. Therefore, the pixel-wise medium transmission can be predicted according to the pixel value's distribution on such a curve. This method is generalizable and can be extended to hazy images. Moreover, according to the fact that ambient light exists in the infinite distant region of an outdoor image, we propose a new deep learning-based framework to estimate the ambient light by distant region segmentation (DRS). Qualitative and quantitative results show that the proposed methods achieve superior performance in comparison with state-of-the-art methods. In the detection problem, crowded objects present large-scale variation and severe occlusion, posing great challenges to object detectors. In addition, current crowd datasets only provide coarse point-level annotations, i.e., human heads are labeled as points, so state-of-the-art object detectors cannot be trivially applied to such point supervision. In our second research direction, we propose a novel self-training approach that enables a typical object detector trained only with point-level annotations to densely predict center points and sizes of crowded objects, termed Crowd-DCNet. Specifically, we propose the locally-uniform distribution assumption (LUDA) for initializing pseudo object sizes from point-level supervisory information, the crowdedness-aware loss for regressing object sizes, and the confidence and order-aware refinement scheme for refining the pseudo object sizes continuously during training. With our self-training approach, the ability of the detector is increasingly boosted. Moreover, bypassing object detection, we introduce a compact convolutional neural network (CNN) for object counting in video surveillance, in which a multi-scale density (MSD) regressor is employed to predict the coarse- and fine-scale density maps. The comprehensive experimental results on six challenging benchmark datasets show that our approach significantly outperforms state-of-the-art methods under both detection and counting tasks. In the recognition problem, small objects in unconstrained scenes adversely affect the accuracy of automatic recognition systems. Our third research direction focuses on automatic license plate recognition (ALPR) in unconstrained environments, such as oblique views, uneven illumination, and various weather conditions. Our study produces an outstanding design of ALPR with four insights: (1) the resampling-based cascaded framework is beneficial to both speed and accuracy; (2) the highly efficient license plate recognition should abandon additional character segmentation and recurrent neural network (RNN), but adopt a plain CNN; (3) in the case of CNN, taking advantage of vertex information on license plates improves recognition performance; and (4) the weight-sharing character classifier addresses the lack of training images in small-scale datasets. Based on these insights, we propose a real-time and high-performing ALPR approach, termed VSNet. The vertex supervisory information is fully exploited for training a detector (VertexNet) to predict the geometric shapes of license plates such that license plates can be rectified and their characters can be densely predicted by a recognizer (SCR-Net). Moreover, we propose a dynamic regularization method to avoid overfitting and improve the generalization ability of CNN. Experimental results on two challenging benchmark datasets demonstrate the effectiveness of the proposed method. Doctor of Philosophy 2021-07-13T03:21:22Z 2021-07-13T03:21:22Z 2021 Thesis-Doctor of Philosophy Wang, Y. (2021). Dense prediction and deep learning in complex visual scenes. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/152009 https://hdl.handle.net/10356/152009 10.32657/10356/152009 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University