Exploring versatile neural architectures across modalities and perception tasks

Humans perceive the world not only by eyes but also by ears and skin, and understand not only the static structure but also the dynamic changes in the scene. The machines also perceive the world through multiple sensors, like normal cameras and Light Detection and Ranging (LiDAR). Meanwhile, multipl...

Full description

Saved in:

Bibliographic Details
Main Author:	Zhang, Wenwei
Other Authors:	Chen Change Loy
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2023
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Online Access:	https://hdl.handle.net/10356/171935
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

Description
Summary:	Humans perceive the world not only by eyes but also by ears and skin, and understand not only the static structure but also the dynamic changes in the scene. The machines also perceive the world through multiple sensors, like normal cameras and Light Detection and Ranging (LiDAR). Meanwhile, multiple computer vision tasks are formulated, targeting different applications and goals, including object detection, segmentation, and tracking. In the deep learning era, there has been remarkable progress in each single perception task by single-modal data, yet, universal neural architectures that solve multiple tasks from multiple modalities are rare and limited to tasks and modalities. The lack of universal neural architectures divides real-world perception systems, e.g., autonomous driving perception systems, into multiple components, which are complex, inefficient, and error-prone. Therefore, this thesis explores versatile neural architectures and systems that unanimously and effectively tackle different modalities or perception tasks, in the pursuit of simplifying the perception system with improved efficiency and robustness. Toward this goal, this thesis first presents a unified paradigm across modalities that unanimously solves different segmentation tasks, including semantic, instance, and panoptic segmentation. The proposed framework, named K-Net, segments both instances and semantic categories consistently by a group of learnable kernels, where each kernel is responsible for generating a mask for a potential instance or a stuff class. To remedy the difficulties of distinguishing various instances, we propose a kernel update strategy that enables each kernel dynamic and conditional on its meaningful group in the input image. K-Net can be trained in an end-to-end manner with bipartite matching, and its training and inference are naturally NMS-free and box-free. Due to its simplicity, K-Net can be effectively applied to solve segmentation tasks on image and point cloud data to obtain state-of-the-art performance. On image segmentation tasks, K-Net surpasses all previous state-of-the-art single-model results of panoptic segmentation on MS COCO and semantic segmentation on ADE20K. Its instance segmentation performance is also on par with Cascade Mask R-CNN on MS COCO with 60%-90% faster inference speeds. On point cloud segmentation tasks, K-Net surpasses all previous state-of-the-art results of panoptic segmentation on nuScenes and SemanticKITTI datasets. Its application also obtains a superior speed-accuracy tradeoff on video data by tracking objects with discriminative kernel features. Besides unifying multiple tasks, this thesis further explores versatile architectures for multiple modalities, which are common in autonomous vehicles, to improve the reliability and accuracy of the system. Most previous approaches for multi-sensor multi-object tracking are either lack reliability by tightly relying on a single input source (e.g., center camera) or not accurate enough by fusing the results from multiple sensors in post-processing without fully exploiting the inherent information. We design a generic sensor-agnostic multi-modality MOT framework (mmMOT), where each modality (i.e., sensors) can perform its role independently to preserve reliability and further improve its accuracy through a novel multi-modality fusion module. The proposed mmMOT can be trained in an end-to-end manner, enabling joint optimization for the base feature extractor of each modality and an adjacency estimator across modalities. Our mmMOT also makes the first attempt to encode deep representation of point cloud in the data association process in MOT. We conduct extensive experiments to evaluate the effectiveness of the proposed framework on the challenging KITTI benchmark and report state-of-the-art performance. The success of a versatile neural architecture for multiple modalities not only relies on a simple and universal design of deep neural networks but also needs a suitable training strategy to overcome the shortage of multi-modality data, since accurately aligned multi-modality data and the corresponding annotations may be hard to collect. For example, in 3D object detection, it has been observed and is counter-intuitive that multi-modality methods based on point cloud and images perform only marginally better or sometimes worse than approaches that solely use point cloud. We investigate the reason behind this phenomenon and contribute a versatile pipeline, named transformation flow, to allow a richer set of single-modality augmentations can be applied in multi-modality augmentation. We further present a new augmentation technique, Multi-mOdality Cut and pAste (MoCa), that cuts and pastes multi-modality patches of objects in a scene to enrich the training data. The proposed method achieves new state-of-the-art performance on nuScenes dataset and competitive performance on KITTI 3D benchmark. Besides strong data augmentations, unsupervised learning that leverages unlabeled data is a new promising trend that can unleash the power of versatile architectures from limited data annotations. This thesis further presents Dense Siamese Network (DenseSiam), a simple unsupervised learning framework for dense prediction tasks. It learns visual representations by maximizing the similarity between two views of one image with two types of consistency, i.e., pixel consistency and region consistency. DenseSiam benefits from the simple Siamese network and proves that the simple location correspondence and interacted region embeddings are effective enough to learn the similarity. We apply DenseSiam on ImageNet and obtain competitive improvements on various downstream tasks. We also show that only with some extra task-specific losses, the simple framework can directly conduct dense prediction tasks. On an existing unsupervised semantic segmentation benchmark, it surpasses state-of-the-art segmentation methods by 2.1 mIoU with 28\% training costs. Lastly, an efficient and general system that supports multi-modality perception algorithms across tasks is indispensable for the success of the unified paradigms, effective training strategies, and promising learning paradigms. This thesis presents MMDetection3D, an open-source library that implements a rich set of 3D perception algorithms across modalities and datasets. Built upon MMDetection, MMDetection3D inherits similar abstract encapsulations to naturally integrate all the algorithms and modules in MMDetection so that common and universal modules can be used across 2D and 3D. To provide a general and easy-to-extend training system that supports various models and datasets in 3D, MMDetection3D further implements different abstract data structures for multi-modality data of various coordinate systems and sensor combinations. MMDetection3D provides the fastest training speed across algorithms among open-source libraries for 3D perception.

Exploring versatile neural architectures across modalities and perception tasks

Similar Items