Structure-aware multimodal feature fusion for RGB-D scene classification and beyond

While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNN features are suboptimal for scene classification. In this article, we investigate a framework allowing greater spati...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wang, Anran, Cai, Jianfei, Lu, Jiwen, Cham, Tat-Jen
Other Authors:	School of Computer Science and Engineering
Format:	Article
Language:	English
Published:	2020
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Feature Fusion Multimodal Analytics
Online Access:	https://hdl.handle.net/10356/138263
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-138263
record_format	dspace
spelling	sg-ntu-dr.10356-1382632020-04-30T01:24:52Z Structure-aware multimodal feature fusion for RGB-D scene classification and beyond Wang, Anran Cai, Jianfei Lu, Jiwen Cham, Tat-Jen School of Computer Science and Engineering Institute for Media Innovation (IMI) Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Feature Fusion Multimodal Analytics While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNN features are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity—that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity—that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art. NRF (Natl Research Foundation, S’pore) 2020-04-30T01:24:52Z 2020-04-30T01:24:52Z 2018 Journal Article Wang, A., Cai, J., Lu, J., & Cham, T.-J. (2018). Structure-aware multimodal feature fusion for RGB-D scene classification and beyond. ACM Transactions on Multimedia Computing, Communications, and Applications, 14(2s), 39-. doi:10.1145/3115932 1551-6857 https://hdl.handle.net/10356/138263 10.1145/3115932 2s 14 en ACM Transactions on Multimedia Computing, Communications, and Applications © 2018 Association for Computing Machinery (ACM). All rights reserved.
institution	Nanyang Technological University
building	NTU Library
country	Singapore
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Feature Fusion Multimodal Analytics
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Feature Fusion Multimodal Analytics Wang, Anran Cai, Jianfei Lu, Jiwen Cham, Tat-Jen Structure-aware multimodal feature fusion for RGB-D scene classification and beyond
description	While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNN features are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity—that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity—that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Wang, Anran Cai, Jianfei Lu, Jiwen Cham, Tat-Jen
format	Article
author	Wang, Anran Cai, Jianfei Lu, Jiwen Cham, Tat-Jen
author_sort	Wang, Anran
title	Structure-aware multimodal feature fusion for RGB-D scene classification and beyond
title_short	Structure-aware multimodal feature fusion for RGB-D scene classification and beyond
title_full	Structure-aware multimodal feature fusion for RGB-D scene classification and beyond
title_fullStr	Structure-aware multimodal feature fusion for RGB-D scene classification and beyond
title_full_unstemmed	Structure-aware multimodal feature fusion for RGB-D scene classification and beyond
title_sort	structure-aware multimodal feature fusion for rgb-d scene classification and beyond
publishDate	2020
url	https://hdl.handle.net/10356/138263
_version_	1681058277939478528

Structure-aware multimodal feature fusion for RGB-D scene classification and beyond

Similar Items