Motion estimation and prediction from 3D point clouds

Understanding the motion of dynamic environments holds significant benefits for various applications, including robotics and autonomous driving. Scene flow estimation from 3D point clouds, which outputs a per-point 3D motion field between two consecutive time steps, has garnered increasing attention...

Full description

Saved in:
Bibliographic Details
Main Author: Li, Ruibo
Other Authors: Lin Guosheng
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/173430
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Understanding the motion of dynamic environments holds significant benefits for various applications, including robotics and autonomous driving. Scene flow estimation from 3D point clouds, which outputs a per-point 3D motion field between two consecutive time steps, has garnered increasing attention. Although deep learning-based scene flow models have shown promising results, how to capture motion from sparse and irregular point cloud data remains an open question. Furthermore, supervised training of these models demands substantial training data with scene flow annotations, which is both scarce and expensive to collect. To reduce the reliance on scene flow annotations, self-supervised scene flow estimation has emerged as a viable solution, where no annotations are required during training. Apart from scene flow estimation from known point clouds, motion prediction, which generates the future position of point clouds based on past observations, is another active research topic and plays a vital role in path planning and navigation. However, supervised motion prediction methods still rely on abundant motion annotations, while the performance of current self-supervised methods is far from satisfactory. Therefore, motion prediction in a weakly supervised manner is a promising avenue to strike a balance between the effort required for annotations and the performance of models. In this thesis, we study motion estimation and prediction in three different learning paradigms, including fully supervised and self-supervised scene flow estimation, and weakly supervised motion prediction. In fully supervised scene flow estimation, earlier methods treat this task as a per-point regression problem while overlooking the potential rigid motion in local regions. To tackle this limitation, in Chapter 3, we design a new scene flow estimation framework, HCRF-Flow, that effectively integrates the capabilities of Deep Neural Networks and Conditional Random Fields. Specifically, HCRF-Flow contains two components. Firstly, it incorporates a DNN-based flow estimation module that performs per-point motion regression. And then, HCRF-Flow employs a new continuous high-order CRFs module to refine the per-point motion predictions by enforcing point-wise smoothness and region-wise rigidity. By leveraging the two components in unison, HCRF-Flow demonstrates superior performance compared to previous methods. In self-supervised scene flow estimation, when scene flow annotations are unavailable, building correspondences between two consecutive point clouds to approximate its scene flow has been shown to be a feasible approach. However, previous methods commonly rely on point-wise matching that solely considers the distance on 3D point coordinates to obtain correspondences. This approach yields two issues: (1) it ignores other discriminative clues; and (2) the matching process is unconstrained, which may lead to a many-to-one problem. To tackle the issues, in Chapter 4, we generate pseudo scene flow by an optimal transport module, which incorporates 3D coordinates, colors, and surface normal as measures and explicitly enforces one-to-one matching. In addition, we design a refinement module to improve the pseudo scene flow labels further by enforcing point-wise smoothness via a random walk algorithm. Although this method demonstrates promising performance, the employed point matching tends to ignore the potential structured motion within local regions, consequently generating inaccurate pseudo labels. Inspired by the local rigidity assumption, in Chapter 5, we propose to generate pseudo labels by piecewise rigid motion estimation. Specifically, by splitting the first point cloud into local regions, we design a piecewise pseudo label generation module that explicitly encourages region-wise rigid alignments between two point clouds, which in turn generates rigid pseudo labels for each region. Experimental results show that our method attains state-of-the-art performance in self-supervised scene flow learning. For weakly supervised motion prediction, in Chapter 6, we design a new weakly supervised learning paradigm, where fully or partially annotated foreground/background (FG/BG) masks are utilized instead of expensive motion data for supervision. To this end, we design a two-stage weakly supervised motion prediction framework. In Stage 1, we train an FG/BG segmentation network using partially annotated masks. Then in Stage 2, we train a motion prediction network in self-supervised manner. Specifically, during Stage 2, the segmentation network from Stage 1 generates foreground points for the training data, enabling the motion prediction network to undergo self-supervised training on these foreground points. Experiments demonstrate that our weakly supervised models, utilizing FG/BG masks as weak supervision, outperform self-supervised models and achieve comparable performance to some supervised models. To the best of our knowledge, we are the first to study motion prediction in a weakly supervised manner.