Semi-CNN architecture for effective spatio-temporal learning in action recognition

This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is...

全面介紹

Saved in:

書目詳細資料
Main Authors:	Leong, Mei Chee, Prasad, Dilip K., Lee, Yong Tsui, Lin, Feng
其他作者:	School of Mechanical and Aerospace Engineering
格式:	Article
語言:	English
出版:	2021
主題:	Engineering::Mechanical engineering Action Recognition Spatio-temporal Features
在線閱讀:	https://hdl.handle.net/10356/146192
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Nanyang Technological University
語言:	English

id	sg-ntu-dr.10356-146192
record_format	dspace
spelling	sg-ntu-dr.10356-1461922021-01-30T20:11:17Z Semi-CNN architecture for effective spatio-temporal learning in action recognition Leong, Mei Chee Prasad, Dilip K. Lee, Yong Tsui Lin, Feng School of Mechanical and Aerospace Engineering Interdisciplinary Graduate School (IGS) School of Computer Science and Engineering Institute for Media Innovation (IMI) Engineering::Mechanical engineering Action Recognition Spatio-temporal Features This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is to fuse the convolution layers from 2D and 3D CNNs to allow temporal encoding with fewer parameters than 3D CNNs. We adopt transfer learning from pre-trained 2D CNNs for spatial extraction, followed by temporal encoding, before connecting to 3D convolution layers at the top of the architecture. We construct our fusion architecture, semi-CNN, based on three popular models: VGG-16, ResNets and DenseNets, and compare the performance with their corresponding 3D models. Our empirical results evaluated on the action recognition dataset UCF-101 demonstrate that our fusion of 1D, 2D and 3D convolutions outperforms its 3D model of the same depth, with fewer parameters and reduces overfitting. Our semi-CNN architecture achieved an average of 16-30% boost in the top-1 accuracy when evaluated on an input video of 16 frames. Published version 2021-01-29T08:24:09Z 2021-01-29T08:24:09Z 2020 Journal Article Leong, M. C., Prasad, D. K., Lee, Y. T. & Lin, F. (2020). Semi-CNN architecture for effective spatio-temporal learning in action recognition. Applied Sciences, 10(2), 557-. doi:10.3390/app10020557 2076-3417 0000-0001-8123-8982 0000-0002-3693-6973 0000-0002-1199-5870 https://hdl.handle.net/10356/146192 10.3390/app10020557 2-s2.0-85081201454 2 10 en Applied Sciences © 2020 The Author(s). Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Mechanical engineering Action Recognition Spatio-temporal Features
spellingShingle	Engineering::Mechanical engineering Action Recognition Spatio-temporal Features Leong, Mei Chee Prasad, Dilip K. Lee, Yong Tsui Lin, Feng Semi-CNN architecture for effective spatio-temporal learning in action recognition
description	This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is to fuse the convolution layers from 2D and 3D CNNs to allow temporal encoding with fewer parameters than 3D CNNs. We adopt transfer learning from pre-trained 2D CNNs for spatial extraction, followed by temporal encoding, before connecting to 3D convolution layers at the top of the architecture. We construct our fusion architecture, semi-CNN, based on three popular models: VGG-16, ResNets and DenseNets, and compare the performance with their corresponding 3D models. Our empirical results evaluated on the action recognition dataset UCF-101 demonstrate that our fusion of 1D, 2D and 3D convolutions outperforms its 3D model of the same depth, with fewer parameters and reduces overfitting. Our semi-CNN architecture achieved an average of 16-30% boost in the top-1 accuracy when evaluated on an input video of 16 frames.
author2	School of Mechanical and Aerospace Engineering
author_facet	School of Mechanical and Aerospace Engineering Leong, Mei Chee Prasad, Dilip K. Lee, Yong Tsui Lin, Feng
format	Article
author	Leong, Mei Chee Prasad, Dilip K. Lee, Yong Tsui Lin, Feng
author_sort	Leong, Mei Chee
title	Semi-CNN architecture for effective spatio-temporal learning in action recognition
title_short	Semi-CNN architecture for effective spatio-temporal learning in action recognition
title_full	Semi-CNN architecture for effective spatio-temporal learning in action recognition
title_fullStr	Semi-CNN architecture for effective spatio-temporal learning in action recognition
title_full_unstemmed	Semi-CNN architecture for effective spatio-temporal learning in action recognition
title_sort	semi-cnn architecture for effective spatio-temporal learning in action recognition
publishDate	2021
url	https://hdl.handle.net/10356/146192
_version_	1692012985900335104

Semi-CNN architecture for effective spatio-temporal learning in action recognition

相似書籍