Data efficient learning for 3D computer vision

3D computer vision is considered promising with the use of computers to process and analyze 3D data from sensors to extract information about the 3D world. One key difference between 3D and 2D computer vision is the amount of information available to the system. 3D data contains information about th...

Full description

Saved in:
Bibliographic Details
Main Author: Wei, Jiacheng
Other Authors: Yap Kim Hui
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/172301
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-172301
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Wei, Jiacheng
Data efficient learning for 3D computer vision
description 3D computer vision is considered promising with the use of computers to process and analyze 3D data from sensors to extract information about the 3D world. One key difference between 3D and 2D computer vision is the amount of information available to the system. 3D data contains information about the depth of objects in the scene, which can be used to locate and recognize objects more accurately. 2D data, on the other hand, does not contain this depth information, which can make it more challenging to accurately locate and recognize objects. 3D computer vision also allows for the integration of data from multiple sources such as 2D cameras, depth cameras, lidar scanners and manually created 3D assets. 3D computer vision has the potential to enable new applications and technologies. For example, the ability to accurately understand the 3D structure of a scene could be used to enable augmented reality applications and self-driving cars. Just like 2D computer vision, 3D computer vision also relies on large amounts of training data to train deep learning models. There are two primary ways to obtain 3D data for training, either by collecting real 3D data or creating synthetic 3D data. Advances in technology, such as 3D sensors like Lidar, structured light sensors, Time-of-Flight (ToF) cameras, and RGB-D cameras, have made it much easier and more accurate to collect real 3D data. Photogrammetry is another way to obtain real 3D data from multiple photos of an object or environment from different angles using reconstruction algorithms. However, annotating 3D data can be more challenging than annotating 2D data due to the additional dimensions and complexity. Synthetic 3D data can be generated using various 3D modeling and simulation software. Fortunately, synthetic data inherently contains ground truth labels during the creation process. Nevertheless, creating synthetic 3D data can be a complex and time-consuming process that requires a high degree of technical skill and artistic ability. It involves several stages, including modeling, texturing, lighting, and rendering. Each stage requires a different set of skills, and mastering all of them can take time and practice. Therefore, the goals of this thesis lie in reducing the annotation cost of real 3D data and increasing the size and diversity of synthetic datasets. The first part of this thesis proposes two methods for weakly supervised learning in 3D semantic segmentation. The first method predicts point-level results using weak labels on 3D point clouds, utilizing our multi-path region mining module to generate pseudo-point-level labels for training a point cloud segmentation network in a fully supervised manner. We discuss both scene- and subcloud-level weak labels and perform experiments on them. The second method trains a semantic point cloud segmentation network with a small portion of labeled points, using cross-sample feature reallocating and intra-sample feature redistribution modules to transfer features and propagate supervision signals on unlabeled points. Our weakly supervised method can produce competitive results with only 10\% and 1\% of labels compared to the fully supervised counterpart. The second part introduces Biharmonic Augmentation (BA), an efficient data augmentation method that produces plausible and smooth non-rigid deformations on 3D shapes to increase the diversity of point cloud data. We compute biharmonic coordinates and learn deformation prototypes to obtain the overall deformation using a Coefficient Network (CoefNet). Our Adversarial Tuning (AdvTune) framework employs adversarial training to jointly train the CoefNet and classification network and can generate adaptive shape deformation based on the learner state. Our experiments show that BA outperforms various point cloud augmentation methods on different networks. The third part proposes Text-Guided 3D Textured Shape Generation from Pseudo Supervision (TAPS3D), a novel framework for training a text-guided 3D shape generator using 2D multi-view images and pseudo captions. We construct captions from relevant words retrieved from the Contrastive Language-Image Pre-Training (CLIP) vocabulary and use low-level image regularization to increase geometry diversity and produce fine-grained textures. Our model can generate explicit 3D textured shapes from given text without additional test-time optimization. Extensive experiments show the efficacy of our framework in generating high-fidelity 3D shapes relevant to the given text. In summary, we proposed three approaches to address the data shortage problem in 3D computer vision tasks. Firstly, we developed weakly supervised learning methods to reduce the annotation cost for 3D data. Secondly, we proposed data augmentation techniques to artificially increase the size of 3D datasets. Thirdly, we presented a text-guided 3D data generation method to generate 3D data as needed. We conducted extensive experiments and achieved promising results on various datasets, demonstrating the effectiveness and potential of our approaches in addressing the challenges of 3D computer vision tasks.
author2 Yap Kim Hui
author_facet Yap Kim Hui
Wei, Jiacheng
format Thesis-Doctor of Philosophy
author Wei, Jiacheng
author_sort Wei, Jiacheng
title Data efficient learning for 3D computer vision
title_short Data efficient learning for 3D computer vision
title_full Data efficient learning for 3D computer vision
title_fullStr Data efficient learning for 3D computer vision
title_full_unstemmed Data efficient learning for 3D computer vision
title_sort data efficient learning for 3d computer vision
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/172301
_version_ 1787590738978013184
spelling sg-ntu-dr.10356-1723012024-01-04T06:32:51Z Data efficient learning for 3D computer vision Wei, Jiacheng Yap Kim Hui School of Electrical and Electronic Engineering EKHYap@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence 3D computer vision is considered promising with the use of computers to process and analyze 3D data from sensors to extract information about the 3D world. One key difference between 3D and 2D computer vision is the amount of information available to the system. 3D data contains information about the depth of objects in the scene, which can be used to locate and recognize objects more accurately. 2D data, on the other hand, does not contain this depth information, which can make it more challenging to accurately locate and recognize objects. 3D computer vision also allows for the integration of data from multiple sources such as 2D cameras, depth cameras, lidar scanners and manually created 3D assets. 3D computer vision has the potential to enable new applications and technologies. For example, the ability to accurately understand the 3D structure of a scene could be used to enable augmented reality applications and self-driving cars. Just like 2D computer vision, 3D computer vision also relies on large amounts of training data to train deep learning models. There are two primary ways to obtain 3D data for training, either by collecting real 3D data or creating synthetic 3D data. Advances in technology, such as 3D sensors like Lidar, structured light sensors, Time-of-Flight (ToF) cameras, and RGB-D cameras, have made it much easier and more accurate to collect real 3D data. Photogrammetry is another way to obtain real 3D data from multiple photos of an object or environment from different angles using reconstruction algorithms. However, annotating 3D data can be more challenging than annotating 2D data due to the additional dimensions and complexity. Synthetic 3D data can be generated using various 3D modeling and simulation software. Fortunately, synthetic data inherently contains ground truth labels during the creation process. Nevertheless, creating synthetic 3D data can be a complex and time-consuming process that requires a high degree of technical skill and artistic ability. It involves several stages, including modeling, texturing, lighting, and rendering. Each stage requires a different set of skills, and mastering all of them can take time and practice. Therefore, the goals of this thesis lie in reducing the annotation cost of real 3D data and increasing the size and diversity of synthetic datasets. The first part of this thesis proposes two methods for weakly supervised learning in 3D semantic segmentation. The first method predicts point-level results using weak labels on 3D point clouds, utilizing our multi-path region mining module to generate pseudo-point-level labels for training a point cloud segmentation network in a fully supervised manner. We discuss both scene- and subcloud-level weak labels and perform experiments on them. The second method trains a semantic point cloud segmentation network with a small portion of labeled points, using cross-sample feature reallocating and intra-sample feature redistribution modules to transfer features and propagate supervision signals on unlabeled points. Our weakly supervised method can produce competitive results with only 10\% and 1\% of labels compared to the fully supervised counterpart. The second part introduces Biharmonic Augmentation (BA), an efficient data augmentation method that produces plausible and smooth non-rigid deformations on 3D shapes to increase the diversity of point cloud data. We compute biharmonic coordinates and learn deformation prototypes to obtain the overall deformation using a Coefficient Network (CoefNet). Our Adversarial Tuning (AdvTune) framework employs adversarial training to jointly train the CoefNet and classification network and can generate adaptive shape deformation based on the learner state. Our experiments show that BA outperforms various point cloud augmentation methods on different networks. The third part proposes Text-Guided 3D Textured Shape Generation from Pseudo Supervision (TAPS3D), a novel framework for training a text-guided 3D shape generator using 2D multi-view images and pseudo captions. We construct captions from relevant words retrieved from the Contrastive Language-Image Pre-Training (CLIP) vocabulary and use low-level image regularization to increase geometry diversity and produce fine-grained textures. Our model can generate explicit 3D textured shapes from given text without additional test-time optimization. Extensive experiments show the efficacy of our framework in generating high-fidelity 3D shapes relevant to the given text. In summary, we proposed three approaches to address the data shortage problem in 3D computer vision tasks. Firstly, we developed weakly supervised learning methods to reduce the annotation cost for 3D data. Secondly, we proposed data augmentation techniques to artificially increase the size of 3D datasets. Thirdly, we presented a text-guided 3D data generation method to generate 3D data as needed. We conducted extensive experiments and achieved promising results on various datasets, demonstrating the effectiveness and potential of our approaches in addressing the challenges of 3D computer vision tasks. Doctor of Philosophy 2023-12-05T06:14:02Z 2023-12-05T06:14:02Z 2023 Thesis-Doctor of Philosophy Wei, J. (2023). Data efficient learning for 3D computer vision. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/172301 https://hdl.handle.net/10356/172301 10.32657/10356/172301 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University