Action-stage emphasized spatiotemporal VLAD for video action recognition

Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not yet achieve the same impressive results on action recognition in videos. This is partially due to the inability of CNN for modeling long-range temporal structures especially those involving individual a...

Full description

Saved in:

Bibliographic Details
Main Authors:	Tu, Zhigang, Li, Hongyan, Zhang, Dejun, Dauwels, Justin, Li, Baoxin, Yuan, Junsong
Other Authors:	School of Electrical and Electronic Engineering
Format:	Article
Language:	English
Published:	2021
Subjects:	Engineering::Electrical and electronic engineering Action Recognition Feature Encoding
Online Access:	https://hdl.handle.net/10356/150982
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-150982
record_format	dspace
spelling	sg-ntu-dr.10356-1509822021-06-02T03:56:20Z Action-stage emphasized spatiotemporal VLAD for video action recognition Tu, Zhigang Li, Hongyan Zhang, Dejun Dauwels, Justin Li, Baoxin Yuan, Junsong School of Electrical and Electronic Engineering Engineering::Electrical and electronic engineering Action Recognition Feature Encoding Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not yet achieve the same impressive results on action recognition in videos. This is partially due to the inability of CNN for modeling long-range temporal structures especially those involving individual action stages that are critical to human action recognition. In this paper, we propose a novel action-stage (ActionS) emphasized spatiotemporal Vector of Locally Aggregated Descriptors (ActionS-STVLAD) method to aggregate informative deep features across the entire video according to adaptive video feature segmentation and adaptive segment feature sampling (AVFS-ASFS). In our ActionSST- VLAD encoding approach, by using AVFS-ASFS, the key frame features are chosen and the corresponding deep features are automatically split into segments with the features in each segment belonging to a temporally coherent ActionS. Then, based on the extracted key frame feature in each segment, a flow-guided warping technique is introduced to detect and discard redundant feature maps, while the informative ones are aggregated by using our exploited similarity weight. Furthermore, we exploit an RGBF modality to capture motion salient regions in the RGB images corresponding to action activity. Extensive experiments are conducted on four public benchmarks - HMDB51, UCF101, Kinetics and ActivityNet for evaluation. Results show that our method is able to effectively pool useful deep features spatiotemporally, leading to state-of-the-art performance for videobased action recognition. 2021-06-02T03:56:19Z 2021-06-02T03:56:19Z 2019 Journal Article Tu, Z., Li, H., Zhang, D., Dauwels, J., Li, B. & Yuan, J. (2019). Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Transactions On Image Processing, 28(6), 2799-2812. https://dx.doi.org/10.1109/TIP.2018.2890749 1057-7149 0000-0001-5003-2260 0000-0001-9129-534X 0000-0002-4390-1568 0000-0002-7324-7034 https://hdl.handle.net/10356/150982 10.1109/TIP.2018.2890749 30605101 2-s2.0-85063468385 6 28 2799 2812 en IEEE Transactions on Image Processing © 2019 IEEE. All rights reserved.
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Electrical and electronic engineering Action Recognition Feature Encoding
spellingShingle	Engineering::Electrical and electronic engineering Action Recognition Feature Encoding Tu, Zhigang Li, Hongyan Zhang, Dejun Dauwels, Justin Li, Baoxin Yuan, Junsong Action-stage emphasized spatiotemporal VLAD for video action recognition
description	Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not yet achieve the same impressive results on action recognition in videos. This is partially due to the inability of CNN for modeling long-range temporal structures especially those involving individual action stages that are critical to human action recognition. In this paper, we propose a novel action-stage (ActionS) emphasized spatiotemporal Vector of Locally Aggregated Descriptors (ActionS-STVLAD) method to aggregate informative deep features across the entire video according to adaptive video feature segmentation and adaptive segment feature sampling (AVFS-ASFS). In our ActionSST- VLAD encoding approach, by using AVFS-ASFS, the key frame features are chosen and the corresponding deep features are automatically split into segments with the features in each segment belonging to a temporally coherent ActionS. Then, based on the extracted key frame feature in each segment, a flow-guided warping technique is introduced to detect and discard redundant feature maps, while the informative ones are aggregated by using our exploited similarity weight. Furthermore, we exploit an RGBF modality to capture motion salient regions in the RGB images corresponding to action activity. Extensive experiments are conducted on four public benchmarks - HMDB51, UCF101, Kinetics and ActivityNet for evaluation. Results show that our method is able to effectively pool useful deep features spatiotemporally, leading to state-of-the-art performance for videobased action recognition.
author2	School of Electrical and Electronic Engineering
author_facet	School of Electrical and Electronic Engineering Tu, Zhigang Li, Hongyan Zhang, Dejun Dauwels, Justin Li, Baoxin Yuan, Junsong
format	Article
author	Tu, Zhigang Li, Hongyan Zhang, Dejun Dauwels, Justin Li, Baoxin Yuan, Junsong
author_sort	Tu, Zhigang
title	Action-stage emphasized spatiotemporal VLAD for video action recognition
title_short	Action-stage emphasized spatiotemporal VLAD for video action recognition
title_full	Action-stage emphasized spatiotemporal VLAD for video action recognition
title_fullStr	Action-stage emphasized spatiotemporal VLAD for video action recognition
title_full_unstemmed	Action-stage emphasized spatiotemporal VLAD for video action recognition
title_sort	action-stage emphasized spatiotemporal vlad for video action recognition
publishDate	2021
url	https://hdl.handle.net/10356/150982
_version_	1702431271962017792

Action-stage emphasized spatiotemporal VLAD for video action recognition

Similar Items