Semantic cues enhanced multimodality multistream CNN for action recognition

This paper addresses the issue of video-based action recognition by exploiting an advanced multistream convolutional neural network (CNN) to fully use semantics-derived multiple modalities in both spatial (appearance) and temporal (motion) domains, since the performance of the CNN-based action recog...

Full description

Saved in:

Bibliographic Details
Main Authors:	Tu, Zhigang, Xie, Wei, Dauwels, Justin, Li, Baoxin, Yuan, Junsong
Other Authors:	School of Electrical and Electronic Engineering
Format:	Article
Language:	English
Published:	2020
Subjects:	Engineering::Electrical and electronic engineering Action Recognition Multi-stream CNN
Online Access:	https://hdl.handle.net/10356/142212
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-142212
record_format	dspace
spelling	sg-ntu-dr.10356-1422122020-06-17T06:54:36Z Semantic cues enhanced multimodality multistream CNN for action recognition Tu, Zhigang Xie, Wei Dauwels, Justin Li, Baoxin Yuan, Junsong School of Electrical and Electronic Engineering Engineering::Electrical and electronic engineering Action Recognition Multi-stream CNN This paper addresses the issue of video-based action recognition by exploiting an advanced multistream convolutional neural network (CNN) to fully use semantics-derived multiple modalities in both spatial (appearance) and temporal (motion) domains, since the performance of the CNN-based action recognition methods heavily relates to two factors: semantic visual cues and the network architecture. Our work consists of two major parts. First, to extract useful human-related semantics accurately, we propose a novel spatiotemporal saliency-based video object segmentation (STS) model. By fusing different distinctive saliency maps, which are computed according to object signatures of complementary object detection approaches, a refined STS maps can be obtained. In this way, various challenges in the realistic video can be handled jointly. Based on the estimated saliency maps, an energy function is constructed to segment two semantic cues: the actor and one distinctive acting part of the actor. Second, we modify the architecture of the two-stream network (TS-Net) to design a multistream network that consists of three TS-Nets with respect to the extracted semantics, which is able to use deeper abstract visual features of multimodalities in multi-scale spatiotemporally. Importantly, the performance of action recognition is significantly boosted when integrating the captured human-related semantics into our framework. Experiments on four public benchmarks-JHMDB, HMDB51, UCF-Sports, and UCF101-demonstrate that the proposed method outperforms the state-of-the-art algorithms. MOE (Min. of Education, S’pore) 2020-06-17T06:54:36Z 2020-06-17T06:54:36Z 2018 Journal Article Tu, Z., Xie, W., Dauwels, J., Li, B., & Yuan, J. (2019). Semantic cues enhanced multimodality multistream CNN for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 29(5), 1423-1437. doi:10.1109/TCSVT.2018.2830102 1051-8215 https://hdl.handle.net/10356/142212 10.1109/TCSVT.2018.2830102 2-s2.0-85045994931 5 29 1423 1437 en IEEE Transactions on Circuits and Systems for Video Technology © 2018 IEEE. All rights reserved.
institution	Nanyang Technological University
building	NTU Library
country	Singapore
collection	DR-NTU
language	English
topic	Engineering::Electrical and electronic engineering Action Recognition Multi-stream CNN
spellingShingle	Engineering::Electrical and electronic engineering Action Recognition Multi-stream CNN Tu, Zhigang Xie, Wei Dauwels, Justin Li, Baoxin Yuan, Junsong Semantic cues enhanced multimodality multistream CNN for action recognition
description	This paper addresses the issue of video-based action recognition by exploiting an advanced multistream convolutional neural network (CNN) to fully use semantics-derived multiple modalities in both spatial (appearance) and temporal (motion) domains, since the performance of the CNN-based action recognition methods heavily relates to two factors: semantic visual cues and the network architecture. Our work consists of two major parts. First, to extract useful human-related semantics accurately, we propose a novel spatiotemporal saliency-based video object segmentation (STS) model. By fusing different distinctive saliency maps, which are computed according to object signatures of complementary object detection approaches, a refined STS maps can be obtained. In this way, various challenges in the realistic video can be handled jointly. Based on the estimated saliency maps, an energy function is constructed to segment two semantic cues: the actor and one distinctive acting part of the actor. Second, we modify the architecture of the two-stream network (TS-Net) to design a multistream network that consists of three TS-Nets with respect to the extracted semantics, which is able to use deeper abstract visual features of multimodalities in multi-scale spatiotemporally. Importantly, the performance of action recognition is significantly boosted when integrating the captured human-related semantics into our framework. Experiments on four public benchmarks-JHMDB, HMDB51, UCF-Sports, and UCF101-demonstrate that the proposed method outperforms the state-of-the-art algorithms.
author2	School of Electrical and Electronic Engineering
author_facet	School of Electrical and Electronic Engineering Tu, Zhigang Xie, Wei Dauwels, Justin Li, Baoxin Yuan, Junsong
format	Article
author	Tu, Zhigang Xie, Wei Dauwels, Justin Li, Baoxin Yuan, Junsong
author_sort	Tu, Zhigang
title	Semantic cues enhanced multimodality multistream CNN for action recognition
title_short	Semantic cues enhanced multimodality multistream CNN for action recognition
title_full	Semantic cues enhanced multimodality multistream CNN for action recognition
title_fullStr	Semantic cues enhanced multimodality multistream CNN for action recognition
title_full_unstemmed	Semantic cues enhanced multimodality multistream CNN for action recognition
title_sort	semantic cues enhanced multimodality multistream cnn for action recognition
publishDate	2020
url	https://hdl.handle.net/10356/142212
_version_	1681058705070620672

Semantic cues enhanced multimodality multistream CNN for action recognition

Similar Items