Feature learning for RGB-D scene understanding

Scene understanding is an important and fundamental problem in computer vision and is critical in applications of robotics and augmented reality. Scene understanding includes many tasks such as scene labeling, object recognition and scene classification. Most previous scene understanding methods foc...

Full description

Saved in:
Bibliographic Details
Main Author: Wang, Anran
Other Authors: Cai Jianfei
Format: Theses and Dissertations
Language:English
Published: 2016
Subjects:
Online Access:https://hdl.handle.net/10356/68538
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-68538
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
spellingShingle DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Wang, Anran
Feature learning for RGB-D scene understanding
description Scene understanding is an important and fundamental problem in computer vision and is critical in applications of robotics and augmented reality. Scene understanding includes many tasks such as scene labeling, object recognition and scene classification. Most previous scene understanding methods focus on outdoor scenes. In contrast, indoor scene understanding is more challenging, due to poor illumination and cluttered objects. With the wide availability of affordable RGB-D cameras such as Kinect, huge changes have been made to indoor scene analysis due to the rich 3D geometry information provided by depth measurements. Feature extraction is the key part for scene understanding tasks. Most of the early methods extract hand-crafted features. However, the performance of such feature extractors highly depends on variations in hand-crafting and combinations. The designing process requires empirical understanding of data, thus hard to systematically extend to different modalities. In addition, the hand-crafted features usually capture a subset of recognition cues from raw data, which might ignore some useful information. Thus, in this research, we focus on feature learning with raw data as input. Particularly, we explore feature learning on three different tasks of indoor scene understanding using RGB-D input: Scene labeling: The aim is to densely assign a category label (e.g. table, TV) to each pixel in an image. Inspired by the success of unsupervised feature learning, we start by adapting the existing unsupervised feature learning technique to directly learn features from RGB-D images. Typically, better performance could be achieved by further applying feature encoding over the learned features to build "bag of words" type of features. However, feature learning and feature encoding are performed separately, which may result in suboptimal solution. We propose to jointly optimize these two processes to derive more discriminative features. Object recognition: Most of the feature learning methods for RGB-D object recognition either learn the features for individual modalities independently, or treat RGB-D simply as undifferentiated four-channel data, which cannot adequately exploit the complementary relationship between the two modalities. To address the above issues, we propose a general Convolutional Neural Networks (CNN) based multi-modal learning method for RGB-D object recognition. Our multi-modal layer is designed to not only discover the most discriminative features for each modality, but also harness the complementary relationship between the two modalities. Scene classification: Methods for scene classification task to leverage local information share a similar pipeline: first densely extracting CNN features from different locations and scales of an image, and then using an encoding method. However, for state-of-the-art feature encoding techniques such as Fisher Vector (FV), since their components in Gaussian Mixture Model (GMM) are derived from densely sampled local features, many components are likely to be noisy and not informative. Such noisy property of local features has not been well considered in the existing works. Further considering the FV features from different modalities, we propose a modality and component aware feature fusion framework for RGB-D scene classification. In this thesis, various experiments have been constructed to evaluate the performance of the proposed techniques in comparison to the state-of-the-art methods on different RGB-D databases. Encouraging results show that the proposed techniques significantly boost the performance in the studied scene understanding tasks.
author2 Cai Jianfei
author_facet Cai Jianfei
Wang, Anran
format Theses and Dissertations
author Wang, Anran
author_sort Wang, Anran
title Feature learning for RGB-D scene understanding
title_short Feature learning for RGB-D scene understanding
title_full Feature learning for RGB-D scene understanding
title_fullStr Feature learning for RGB-D scene understanding
title_full_unstemmed Feature learning for RGB-D scene understanding
title_sort feature learning for rgb-d scene understanding
publishDate 2016
url https://hdl.handle.net/10356/68538
_version_ 1759857331033604096
spelling sg-ntu-dr.10356-685382023-03-04T00:51:17Z Feature learning for RGB-D scene understanding Wang, Anran Cai Jianfei Cham Tat Jen School of Computer Science and Engineering DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Scene understanding is an important and fundamental problem in computer vision and is critical in applications of robotics and augmented reality. Scene understanding includes many tasks such as scene labeling, object recognition and scene classification. Most previous scene understanding methods focus on outdoor scenes. In contrast, indoor scene understanding is more challenging, due to poor illumination and cluttered objects. With the wide availability of affordable RGB-D cameras such as Kinect, huge changes have been made to indoor scene analysis due to the rich 3D geometry information provided by depth measurements. Feature extraction is the key part for scene understanding tasks. Most of the early methods extract hand-crafted features. However, the performance of such feature extractors highly depends on variations in hand-crafting and combinations. The designing process requires empirical understanding of data, thus hard to systematically extend to different modalities. In addition, the hand-crafted features usually capture a subset of recognition cues from raw data, which might ignore some useful information. Thus, in this research, we focus on feature learning with raw data as input. Particularly, we explore feature learning on three different tasks of indoor scene understanding using RGB-D input: Scene labeling: The aim is to densely assign a category label (e.g. table, TV) to each pixel in an image. Inspired by the success of unsupervised feature learning, we start by adapting the existing unsupervised feature learning technique to directly learn features from RGB-D images. Typically, better performance could be achieved by further applying feature encoding over the learned features to build "bag of words" type of features. However, feature learning and feature encoding are performed separately, which may result in suboptimal solution. We propose to jointly optimize these two processes to derive more discriminative features. Object recognition: Most of the feature learning methods for RGB-D object recognition either learn the features for individual modalities independently, or treat RGB-D simply as undifferentiated four-channel data, which cannot adequately exploit the complementary relationship between the two modalities. To address the above issues, we propose a general Convolutional Neural Networks (CNN) based multi-modal learning method for RGB-D object recognition. Our multi-modal layer is designed to not only discover the most discriminative features for each modality, but also harness the complementary relationship between the two modalities. Scene classification: Methods for scene classification task to leverage local information share a similar pipeline: first densely extracting CNN features from different locations and scales of an image, and then using an encoding method. However, for state-of-the-art feature encoding techniques such as Fisher Vector (FV), since their components in Gaussian Mixture Model (GMM) are derived from densely sampled local features, many components are likely to be noisy and not informative. Such noisy property of local features has not been well considered in the existing works. Further considering the FV features from different modalities, we propose a modality and component aware feature fusion framework for RGB-D scene classification. In this thesis, various experiments have been constructed to evaluate the performance of the proposed techniques in comparison to the state-of-the-art methods on different RGB-D databases. Encouraging results show that the proposed techniques significantly boost the performance in the studied scene understanding tasks. DOCTOR OF PHILOSOPHY (SCE) 2016-05-26T08:08:53Z 2016-05-26T08:08:53Z 2016 Thesis Wang, A. (2016). Feature learning for RGB-D scene understanding. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/68538 10.32657/10356/68538 en 118 p. application/pdf