LiDAR-based 3D object detection and tracking for autonomous driving

LiDAR-based 3D perception algorithms have attracted increasing attention due to their immense potential in applications such as autonomous driving and robotic vision. Point clouds collected by LiDAR sensors provide accurate 3D coordinates, making them a natural fit for generating accurate 3D predict...

Full description

Saved in:
Bibliographic Details
Main Author: Luo, Zhipeng
Other Authors: Lu Shijian
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/174239
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:LiDAR-based 3D perception algorithms have attracted increasing attention due to their immense potential in applications such as autonomous driving and robotic vision. Point clouds collected by LiDAR sensors provide accurate 3D coordinates, making them a natural fit for generating accurate 3D prediction results. However, different from the commonly used image data, point clouds are unordered in nature and often grapple with point sparsity for remote regions and a dearth of textural information. These unique properties make the well-studied image-based algorithms not directly applicable to point clouds and it motivates us to further explore effective solutions. In the context of autonomous driving, 3D perception consists of a wide range of tasks, such as object detection, segmentation, object tracking, motion forecasting, occupancy prediction, etc. This thesis delves into two fundamental LiDAR-based perception tasks: object detection and object tracking. For 3D object detection, one major issue is that the annotation of numerous 3D bounding boxes is laborious and costly, which limits its broader applications. It is thus favorable to utilize models trained with large-scale open-sourced datasets for various purposes. However, 3D detectors trained on one dataset often suffer significant performance degradation when applied to another scenario. In Chapter 3, we first investigate the major underlying factors of the domain gap in 3D detection. Our key insight is that geometric mismatch is a major factor of domain shifts. Different from existing methods that leverage the statistical information of source and target domains to mitigate the mismatch, we propose a novel and unified framework, Multi-Level Consistency Network (MLC-Net), which employs a teacher-student paradigm to generate adaptive and reliable pseudo-targets. MLC-Net exploits point-, instance- and neural-statistics-level consistency to facilitate effective cross-domain transfer. Another challenge of point cloud-based object detection is point sparsity, which is an inherent limitation of LiDAR sensors. While consecutive frames naturally lead to more complete views, most existing methods focus on single point cloud frames without harnessing the temporal information in point cloud sequences. In Chapter 4, we design TransPillars, a novel transformer-based feature aggregation technique that exploits temporal features of consecutive point cloud frames for multi-frame 3D object detection. TransPillars aggregates spatial-temporal point cloud features from two perspectives. First, it fuses voxel-level features directly from multi-frame feature maps instead of pooled instance features to preserve instance details with contextual information that are essential to accurate object localization. Second, it introduces a hierarchical coarse-to-fine strategy to fuse multi-scale features progressively to effectively capture the motion of moving objects and guide the aggregation of fine features. Besides, a variant of deformable attention is introduced to improve the effectiveness of cross-frame feature matching. For object tracking, given a point cloud sequence and the initial position of a target object, the objective is to predict the location and orientation of this object in consecutive frames. In Chapter 5, we propose Point Tracking TRansformer (PTTR), which efficiently predicts high-quality 3D tracking results in a coarse-to-fine manner with the help of transformer operations. PTTR incorporates Relation-Aware Sampling and Point Relation Transformer modules for effective feature extraction and matching. Moreover, we propose PTTR++ by exploiting the complementary effect of point-wise and Bird's-Eye View (BEV) representations to further enhance the tracking performance. In Chapter 6, we leverage the previously overlooked long-range continuous motion property of objects in 3D space and propose a novel tracking approach that views each tracklet as a continuous stream: at each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank, enabling efficient exploitation of sequential information. To achieve effective cross-frame message passing, a hybrid attention mechanism is designed to account for both long-range relation modeling and local geometric feature extraction. Furthermore, to enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is designed, which uses ground truth tracklets to augment training sequences and promote discrimination against false positives in a contrastive manner.