Data efficient learning for 3D computer vision
3D computer vision is considered promising with the use of computers to process and analyze 3D data from sensors to extract information about the 3D world. One key difference between 3D and 2D computer vision is the amount of information available to the system. 3D data contains information about th...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/172301 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | 3D computer vision is considered promising with the use of computers to process and analyze 3D data from sensors to extract information about the 3D world. One key difference between 3D and 2D computer vision is the amount of information available to the system. 3D data contains information about the depth of objects in the scene, which can be used to locate and recognize objects more accurately. 2D data, on the other hand, does not contain this depth information, which can make it more challenging to accurately locate and recognize objects. 3D computer vision also allows for the integration of data from multiple sources such as 2D cameras, depth cameras, lidar scanners and manually created 3D assets. 3D computer vision has the potential to enable new applications and technologies. For example, the ability to accurately understand the 3D structure of a scene could be used to enable augmented reality applications and self-driving cars.
Just like 2D computer vision, 3D computer vision also relies on large amounts of training data to train deep learning models. There are two primary ways to obtain 3D data for training, either by collecting real 3D data or creating synthetic 3D data. Advances in technology, such as 3D sensors like Lidar, structured light sensors, Time-of-Flight (ToF) cameras, and RGB-D cameras, have made it much easier and more accurate to collect real 3D data. Photogrammetry is another way to obtain real 3D data from multiple photos of an object or environment from different angles using reconstruction algorithms. However, annotating 3D data can be more challenging than annotating 2D data due to the additional dimensions and complexity. Synthetic 3D data can be generated using various 3D modeling and simulation software. Fortunately, synthetic data inherently contains ground truth labels during the creation process. Nevertheless, creating synthetic 3D data can be a complex and time-consuming process that requires a high degree of technical skill and artistic ability. It involves several stages, including modeling, texturing, lighting, and rendering. Each stage requires a different set of skills, and mastering all of them can take time and practice. Therefore, the goals of this thesis lie in reducing the annotation cost of real 3D data and increasing the size and diversity of synthetic datasets.
The first part of this thesis proposes two methods for weakly supervised learning in 3D semantic segmentation. The first method predicts point-level results using weak labels on 3D point clouds, utilizing our multi-path region mining module to generate pseudo-point-level labels for training a point cloud segmentation network in a fully supervised manner. We discuss both scene- and subcloud-level weak labels and perform experiments on them. The second method trains a semantic point cloud segmentation network with a small portion of labeled points, using cross-sample feature reallocating and intra-sample feature redistribution modules to transfer features and propagate supervision signals on unlabeled points. Our weakly supervised method can produce competitive results with only 10\% and 1\% of labels compared to the fully supervised counterpart.
The second part introduces Biharmonic Augmentation (BA), an efficient data augmentation method that produces plausible and smooth non-rigid deformations on 3D shapes to increase the diversity of point cloud data. We compute biharmonic coordinates and learn deformation prototypes to obtain the overall deformation using a Coefficient Network (CoefNet). Our Adversarial Tuning (AdvTune) framework employs adversarial training to jointly train the CoefNet and classification network and can generate adaptive shape deformation based on the learner state. Our experiments show that BA outperforms various point cloud augmentation methods on different networks.
The third part proposes Text-Guided 3D Textured Shape Generation from Pseudo Supervision (TAPS3D), a novel framework for training a text-guided 3D shape generator using 2D multi-view images and pseudo captions. We construct captions from relevant words retrieved from the Contrastive Language-Image Pre-Training (CLIP) vocabulary and use low-level image regularization to increase geometry diversity and produce fine-grained textures. Our model can generate explicit 3D textured shapes from given text without additional test-time optimization. Extensive experiments show the efficacy of our framework in generating high-fidelity 3D shapes relevant to the given text.
In summary, we proposed three approaches to address the data shortage problem in 3D computer vision tasks. Firstly, we developed weakly supervised learning methods to reduce the annotation cost for 3D data. Secondly, we proposed data augmentation techniques to artificially increase the size of 3D datasets. Thirdly, we presented a text-guided 3D data generation method to generate 3D data as needed. We conducted extensive experiments and achieved promising results on various datasets, demonstrating the effectiveness and potential of our approaches in addressing the challenges of 3D computer vision tasks. |
---|