Feature learning for RGB-D scene understanding

Scene understanding is an important and fundamental problem in computer vision and is critical in applications of robotics and augmented reality. Scene understanding includes many tasks such as scene labeling, object recognition and scene classification. Most previous scene understanding methods foc...

Full description

Saved in:

Bibliographic Details
Main Author:	Wang, Anran
Other Authors:	Cai Jianfei
Format:	Theses and Dissertations
Language:	English
Published:	2016
Subjects:	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Online Access:	https://hdl.handle.net/10356/68538
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-68538
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
spellingShingle	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Wang, Anran Feature learning for RGB-D scene understanding
description	Scene understanding is an important and fundamental problem in computer vision and is critical in applications of robotics and augmented reality. Scene understanding includes many tasks such as scene labeling, object recognition and scene classification. Most previous scene understanding methods focus on outdoor scenes. In contrast, indoor scene understanding is more challenging, due to poor illumination and cluttered objects. With the wide availability of affordable RGB-D cameras such as Kinect, huge changes have been made to indoor scene analysis due to the rich 3D geometry information provided by depth measurements. Feature extraction is the key part for scene understanding tasks. Most of the early methods extract hand-crafted features. However, the performance of such feature extractors highly depends on variations in hand-crafting and combinations. The designing process requires empirical understanding of data, thus hard to systematically extend to different modalities. In addition, the hand-crafted features usually capture a subset of recognition cues from raw data, which might ignore some useful information. Thus, in this research, we focus on feature learning with raw data as input. Particularly, we explore feature learning on three different tasks of indoor scene understanding using RGB-D input: Scene labeling: The aim is to densely assign a category label (e.g. table, TV) to each pixel in an image. Inspired by the success of unsupervised feature learning, we start by adapting the existing unsupervised feature learning technique to directly learn features from RGB-D images. Typically, better performance could be achieved by further applying feature encoding over the learned features to build "bag of words" type of features. However, feature learning and feature encoding are performed separately, which may result in suboptimal solution. We propose to jointly optimize these two processes to derive more discriminative features. Object recognition: Most of the feature learning methods for RGB-D object recognition either learn the features for individual modalities independently, or treat RGB-D simply as undifferentiated four-channel data, which cannot adequately exploit the complementary relationship between the two modalities. To address the above issues, we propose a general Convolutional Neural Networks (CNN) based multi-modal learning method for RGB-D object recognition. Our multi-modal layer is designed to not only discover the most discriminative features for each modality, but also harness the complementary relationship between the two modalities. Scene classification: Methods for scene classification task to leverage local information share a similar pipeline: first densely extracting CNN features from different locations and scales of an image, and then using an encoding method. However, for state-of-the-art feature encoding techniques such as Fisher Vector (FV), since their components in Gaussian Mixture Model (GMM) are derived from densely sampled local features, many components are likely to be noisy and not informative. Such noisy property of local features has not been well considered in the existing works. Further considering the FV features from different modalities, we propose a modality and component aware feature fusion framework for RGB-D scene classification. In this thesis, various experiments have been constructed to evaluate the performance of the proposed techniques in comparison to the state-of-the-art methods on different RGB-D databases. Encouraging results show that the proposed techniques significantly boost the performance in the studied scene understanding tasks.
author2	Cai Jianfei
author_facet	Cai Jianfei Wang, Anran
format	Theses and Dissertations
author	Wang, Anran
author_sort	Wang, Anran
title	Feature learning for RGB-D scene understanding
title_short	Feature learning for RGB-D scene understanding
title_full	Feature learning for RGB-D scene understanding
title_fullStr	Feature learning for RGB-D scene understanding
title_full_unstemmed	Feature learning for RGB-D scene understanding
title_sort	feature learning for rgb-d scene understanding
publishDate	2016
url	https://hdl.handle.net/10356/68538
_version_	1759857331033604096
spelling	sg-ntu-dr.10356-685382023-03-04T00:51:17Z Feature learning for RGB-D scene understanding Wang, Anran Cai Jianfei Cham Tat Jen School of Computer Science and Engineering DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Scene understanding is an important and fundamental problem in computer vision and is critical in applications of robotics and augmented reality. Scene understanding includes many tasks such as scene labeling, object recognition and scene classification. Most previous scene understanding methods focus on outdoor scenes. In contrast, indoor scene understanding is more challenging, due to poor illumination and cluttered objects. With the wide availability of affordable RGB-D cameras such as Kinect, huge changes have been made to indoor scene analysis due to the rich 3D geometry information provided by depth measurements. Feature extraction is the key part for scene understanding tasks. Most of the early methods extract hand-crafted features. However, the performance of such feature extractors highly depends on variations in hand-crafting and combinations. The designing process requires empirical understanding of data, thus hard to systematically extend to different modalities. In addition, the hand-crafted features usually capture a subset of recognition cues from raw data, which might ignore some useful information. Thus, in this research, we focus on feature learning with raw data as input. Particularly, we explore feature learning on three different tasks of indoor scene understanding using RGB-D input: Scene labeling: The aim is to densely assign a category label (e.g. table, TV) to each pixel in an image. Inspired by the success of unsupervised feature learning, we start by adapting the existing unsupervised feature learning technique to directly learn features from RGB-D images. Typically, better performance could be achieved by further applying feature encoding over the learned features to build "bag of words" type of features. However, feature learning and feature encoding are performed separately, which may result in suboptimal solution. We propose to jointly optimize these two processes to derive more discriminative features. Object recognition: Most of the feature learning methods for RGB-D object recognition either learn the features for individual modalities independently, or treat RGB-D simply as undifferentiated four-channel data, which cannot adequately exploit the complementary relationship between the two modalities. To address the above issues, we propose a general Convolutional Neural Networks (CNN) based multi-modal learning method for RGB-D object recognition. Our multi-modal layer is designed to not only discover the most discriminative features for each modality, but also harness the complementary relationship between the two modalities. Scene classification: Methods for scene classification task to leverage local information share a similar pipeline: first densely extracting CNN features from different locations and scales of an image, and then using an encoding method. However, for state-of-the-art feature encoding techniques such as Fisher Vector (FV), since their components in Gaussian Mixture Model (GMM) are derived from densely sampled local features, many components are likely to be noisy and not informative. Such noisy property of local features has not been well considered in the existing works. Further considering the FV features from different modalities, we propose a modality and component aware feature fusion framework for RGB-D scene classification. In this thesis, various experiments have been constructed to evaluate the performance of the proposed techniques in comparison to the state-of-the-art methods on different RGB-D databases. Encouraging results show that the proposed techniques significantly boost the performance in the studied scene understanding tasks. DOCTOR OF PHILOSOPHY (SCE) 2016-05-26T08:08:53Z 2016-05-26T08:08:53Z 2016 Thesis Wang, A. (2016). Feature learning for RGB-D scene understanding. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/68538 10.32657/10356/68538 en 118 p. application/pdf

Feature learning for RGB-D scene understanding

Similar Items