Semantic scene understanding on 3D point clouds
With the rapid development of industry and intelligent systems, semantic scene understanding has become essential for robotic vision in smart manufacturing. Robots have significantly advanced modern manufacturing by enabling high-quality, efficient production, extended operation durations, and work...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2025
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/182101 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | With the rapid development of industry and intelligent systems, semantic scene understanding has become essential for robotic vision in smart manufacturing. Robots have significantly advanced modern manufacturing by enabling high-quality, efficient production, extended operation durations, and work in hazardous environments. Robotic techniques have automated many processes in production lines. However, in flexible production scenarios, certain tasks cannot yet be fully handled by robots and still require human involvement. This limitation is usually caused by robots' lack of semantic understanding of the target objects in the working environment.
Developing visual scene understanding techniques can enable robots to accurately recognize and localize objects or regions in visual scenes at the pixel level. These techniques greatly enhance the capability and flexibility of robots in the manufacturing industry and in various general robotic applications. Consequently, human effort in the production pipeline can be largely replaced by robots with visual understanding capabilities.
This research mainly focuses on the task of 3D instance segmentation, which aims to predict both semantic and instance labels for each point in point clouds. This is a fundamental and challenging task for scene understanding, with a variety of real-world applications, such as indoor robots, autonomous driving, drones, AR/VR devices, etc.
We propose five different novel methods, including one fully supervised method, two weakly supervised methods, one zero-shot method, and an augmentation method to enhance model generalization. In Chapter 3, we propose a novel proposal-free fully supervised method as Regional Purity Guide Network(RPGN). We define a novel concept of regional purity, which encodes instance-aware contextual information of the surrounding region. We also propose a pretraining pipeline for learning regional purity and design rules to generate random toy scenes by extracting samples from existing training data. Using regional purity can simultaneously prevent under-segmentation and over-segmentation problems during clustering.
Although scene understanding has achieved remarkable success with deep learning techniques, it remains largely unsolved. One critical bottleneck is the significant human effort required for pixel-level labeling. To address this issue, in Chapter 4, we propose a novel weakly supervised method, RWSeg, that requires labeling only one object with a single point. Using these sparse weak labels, we introduce a unified framework with two branches to propagate semantic and instance information to unannotated regions, leveraging self-attention and random walk. Furthermore, we propose a Cross-graph Competing Random Walks (CGCRW) algorithm which encourages competition among different instance graphs to resolve ambiguities in closely positioned objects and improve the performance on instance assignment.
In Chapter 5, we propose the first weakly-supervised 3D instance segmentation method that only needs categorical semantic labels as supervision, and we do not need instance-level labels. Even without having any instance-related ground-truth, we design an approach to break point clouds into raw fragments and find the most confident samples for learning instance centroids. In addition, we build a recomposed dataset to learn our defined multilevel shape-aware objectness signal. An asymmetrical object inference algorithm is followed to process core points and boundary points with different strategies, and generate high-quality pseudo instance labels to guide iterative training.
In the current era dominated by large foundation models, these expansive vision models adeptly capture knowledge from vast, broad datasets, enabling them to execute zero-shot segmentation on previously unseen data. In Chapter 6, we delve into leveraging various 2D foundation models to address the challenges of 3D segmentation tasks. Our approach begins by generating initial predictions of 2D semantic masks using diverse large foundation models. These mask predictions, obtained from different frames of RGB-D video sequences, are then projected into 3D space. To produce robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all results through voting. Our investigation encompasses various scenarios, including zero-shot learning and limited guidance from sparse 2D point labels, allowing us to evaluate the strengths and limitations of different vision foundation models.
Data augmentation is essential in deep learning for improving model generalization and robustness. While standard methods like rotations and flips have been common, they often lack high-level diversity. In Chapter 7, we explore a novel approach to automatically generate 3D labeled training data. By utilizing diffusion models and chatGPT generated text prompts, we generate diverse 2D images of single objects with various structures and appearances. Beyond texture augmentation, our method automatically alters object shapes within these images. These augmented images are then transformed into 3D objects, and virtual scenes are constructed through random composition. This approach efficiently produces a substantial amount of 3D scene data without relying on real data, offering significant advantages in addressing few-shot learning challenges and mitigating long-tailed class imbalances. Our work contributes to enhancing 3D data diversity and advancing model capabilities in scene understanding tasks. |
---|