Deep learning for visual recognition at pixel, object, and image levels

Deep (machine) learning in recent years has significantly increased the predictive modeling strengths for applications in the areas of computer vision, speech recognition, and natural language processing. Being heavily data-driven, conventional general-purpose deep learning methods mainly rely on tr...

Full description

Saved in:
Bibliographic Details
Main Author: Kuen, Jason Wen Yong
Other Authors: Tan Yap Peng
Format: Theses and Dissertations
Language:English
Published: 2019
Subjects:
Online Access:https://hdl.handle.net/10356/107566
http://hdl.handle.net/10220/50321
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Deep (machine) learning in recent years has significantly increased the predictive modeling strengths for applications in the areas of computer vision, speech recognition, and natural language processing. Being heavily data-driven, conventional general-purpose deep learning methods mainly rely on training data to learn task-specific data representations. In this thesis, we focus on leveraging novel learning algorithms in deep learning to tackle visual recognition challenges at the three key levels -- namely pixel, object, and image levels. Throughout the thesis, we demonstrate how each and every of our proposed methods can address a specific visual recognition problem. Firstly, we apply deep learning to improve the saliency detection performance of deep networks. A coarse saliency map of the input image's salient object is initially predicted by a convolutional-deconvolutional network (or equivalently a fully convolutional network). Following that, the coarse saliency map is sequentially refined by a separate network called Recurrent Attentional Convolutional-Deconvolutional Network (RACDNN) that shares its network weights across its sequential steps. At every sequential (temporal) step, the same RACDNN is applied to a newly selected/attended image sub-region and it performs saliency map refinement for that particular sub-region. Despite having many learnable network parameters and no access to additional training data, RADCNN can be trained well and it outperforms the "single-step" (coarse saliency detection) baseline and state-of-the-art saliency detection methods by large margins, on several saliency detection benchmark datasets. Secondly, we introduce auto-encoder weight transfer networks (AE-WTN+), a deep model for scaling up object detection. Conventional WTN transfers object class knowledge from the classification weights of a pretrained large-scale image classification network, to the classification weights of an object detection network. However, they suffer from under-fitting and are not able to preserve the rich object class information of novel classes. In this work, we propose a new WTN method -- AE-WTN+ which makes use of normalization techniques to ease network optimization. Inputs and intermediate features are normalized to improve trainability for reduced under-fitting. More importantly, a novel auto-encoding (reconstruction) loss is proposed to encourage information preservation of all classes in the weight predictions of AE-WTN+. This enables the predicted weights of AE-WTN+ to maintain the class discriminability and neighbourhood relationships contained in the pre-trained large-scale classification weights. Experiments on large-scale detection datasets validate the effectiveness of AE-WTN+ on both seen and novel classes. Thirdly, we improve the efficiency of cross-layer network connections for very deep convolutional network in the context of image classification. This gives rise to a new convolutional network architecture called Deluge Networks (DelugeNets) which pass the output (feature activations) of any preceding composite network layer as direct inputs to all succeeding composite layers. For the incoming feature activations of preceding composite layers, the succeeding composite layers first apply rather lightweight cross-layer depthwise convolutional weights on them for feature aggregation. The heavy lifting is carried out by the last convolutional weight layer in each preceding composite layer that performs cross-channel convolutions. The proposed technique enables unimpeded information flow among many layers in very deep DelugeNets with great efficiency. Experiments on small-scale and large-scale datasets show that DelugeNets perform competitively to state-of-the-art deep convolutional networks at lower computational and parameter costs. Fourthly, we propose Stochastic Downsampling (SDPoint), a method to overcome a major limitation suffered by conventional deep convolutional networks -- their computational costs during inference are fixed and cannot be changed. SDPoint downsamples the feature maps at a randomly selected point (layer index) in the network hierarchy during training. The many downsampling configurations known as SDPoint instances each entail a unique downsampling point and downsampling ratio. The same convolutional network weights are shared across the different SDPoint instances that downsample feature maps differently. During inference, the computational cost of a network trained with SDPoint can be instantaneously adjusted to fit a given computational budget, by manually selecting an appropriate SDPoint instance. Additionally, sharing weights between different feature map scales provides significant regularization benefits, making the networks trained with SDPoint less sensitive to scales and more competent. In image classification experiments, SDPoint demonstrates significant cost-accuracy and computational advantages, over independently-trained (non-weight-shared) baselines with various inference costs.