Scene parsing with deep neural networks

In this thesis, we address the fundamental and challenging task of scene parsing. Scene parsing (also known as semantic segmentation, scene segmentation, scene labeling) aims to classify every pixel of a given image to one of the predefined semantic categories (e.g., person, car, etc.), including no...

Full description

Saved in:
Bibliographic Details
Main Author: Ding, Henghui
Other Authors: Jiang Xudong
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/142935
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:In this thesis, we address the fundamental and challenging task of scene parsing. Scene parsing (also known as semantic segmentation, scene segmentation, scene labeling) aims to classify every pixel of a given image to one of the predefined semantic categories (e.g., person, car, etc.), including not only countable objects (e.g. person, car, cat) but also uncountable stuff (e.g. road, grass, sky). It is a dense prediction task whose output has the same resolution with input, thus it provides much richer clues for scene understanding. As one of the fundamental tasks in computer vision, scene parsing has been an essential component of computer vision and is in intense demand for many practical applications, such as automation devices, virtual reality, augmented reality, self-driving vehicles and etc. Scene parsing implicitly involves object recognition, object localization and boundary delineation, which requires multi-scale and multi-level visual recognition. A robust segmentation approach needs to perform well at all of these implied tasks. We address scene parsing based on deep neural networks and explore to enhance scene parsing performance from different aspects. In details, firstly, we discuss how to aggregate customized context information to enhance the high-level feature representation while keep the local discriminations. Then, we explore to make use of multi-scale and multi-level information to address the issue of huge scale variations of objects in scene parsing. Meanwhile, we discuss how to infer and refine the boundary predictions to enhance the spatial details. Furthermore, to segment the unseen/unknown objects that have never appeared in training set, we build an interactive image segmentation model, which greatly speeds up the pixel-level data annotation process and thus can alleviate the issue of lack of large-scale benchmarks. First of all, we propose CGBNet that employs context encoding and multi-path decoding to enhance the segmentation performance from both encoding and decoding process. In CGBNet, we first propose a context encoding module that generates context-contrasted local feature to make use of the informative context and the discriminative local information. This context encoding module greatly improves the segmentation performance, especially for inconspicuous objects. Furthermore, we propose a scale-selection scheme to selectively fuse the segmentation results from different-scales of features at every spatial position. It adaptively selects appropriate score maps from rich scales of features. To improve the segmentation performance results at boundary, we further propose a boundary delineation module that encourages the location-specific very-low-level features near the boundaries to take part in the final prediction and suppresses them far from the boundaries. Secondly, to further enhance the contextual modeling, we propose SVCNet that mainly target to build customized contextual model. Due to the diverse shapes of objects and their complex layout in various scene images, the spatial scales and shapes of contexts for different objects have very large variation. It is thus ineffective or inefficient to aggregate various context information from a predefined fixed region. In this work, we propose to generate a scale- and shape-variant semantic mask for each pixel to confine its contextual region. To this end, we first propose a novel paired convolution to infer the semantic correlation of the pair and based on that to generate a shape mask. Using the inferred spatial scope of the contextual region, we propose a shape-variant convolution, of which the receptive field is controlled by the shape mask that varies with the appearance of input. In this way, the proposed network aggregates the context information of a pixel from its semantic-correlated region instead of a predefined fixed region. Thirdly, we propose a boundary-aware feature propagation module. To increase the feature similarity of the same object while keeping the feature discrimination of different objects, we explore to propagate information throughout the image under the control of objects' boundaries. To this end, we first propose to learn the boundary as an additional semantic class to enable the network to be aware of the boundary layout. Then, we propose unidirectional acyclic graphs (UAGs) to model the function of undirected cyclic graphs (UCGs), which structurize the image via building graphic pixel-by-pixel connections, in an efficient and effective way. Furthermore, we propose a boundary-aware feature propagation (BFP) module to harvest and propagate the local features within their regions isolated by the learned boundaries in the UAG-structured image. The proposed BFP is capable of splitting the feature propagation into a set of semantic groups via building strong connections among the same segment region but weak connections between different segment regions. Finally, we build an interactive image segmentation model, PhraseClick, which is semi-automated and aims to accurately segment the image into foreground and background given a minimal amount of user interactive inputs. Existing interactive object segmentation methods mainly take spatial interactions such as bounding boxes or clicks as input. However, these interactions do not contain information about conspicuous attributes of the target-of-interest and thus cannot quickly specify what is the target area to select. Therefore, excessive user interactions are often required to reach desirable results. We propose to employ language phrases as another interaction input to infer the attributes of the target object. In this way, we can not only leverage spatial clicks to localize the target object but also utilize semantic phrases to qualify the attributes of the target object. Specifically, the phrase input focuses on ``what" the target object is and the spatial clicks are in charge of ``where" the target object is. Moreover, the proposed approach is flexible in terms of interaction modes and can efficiently handle complex scenarios by leveraging the strengths of each type of input. In summary, with the four proposed segmentation models (CGBNet, SVCNet, BFP and PhraseClick), we have enhanced the scene parsing from different aspects: contextual modeling, multi-scale and multi-level information aggregation, boundary refinement, and unseen/unknown objects' segmentation. We have achieved new state-of-the-art segmentation performance on public benchmarks.