Robust and efficient deep learning methods for vision-based action recognition

Vision-based action recognition, which performs action recognition based solely on RGB frames, has received strong research interest thanks to its wide applications in various fields, e.g. surveillance, smart homes, and autonomous driving. Significant progress has been made in vision-based action re...

Full description

Saved in:
Bibliographic Details
Main Author: Xu, Yuecong
Other Authors: Mao Kezhi
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/153169
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-153169
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Xu, Yuecong
Robust and efficient deep learning methods for vision-based action recognition
description Vision-based action recognition, which performs action recognition based solely on RGB frames, has received strong research interest thanks to its wide applications in various fields, e.g. surveillance, smart homes, and autonomous driving. Significant progress has been made in vision-based action recognition thanks to the development of recognition technologies, particularly deep learning methods which have proven their effectiveness in visual recognition tasks, such as image classification. Compared to static images, videos contain additional information due to the additional temporal dimension, which includes both temporal and spatiotemporal correlation features. Therefore, the key to robust and efficient action recognition lies in the effective and efficient utilization of the temporal and spatiotemporal correlation features embedded within videos. In this thesis, we first investigate extracting temporal features in a robust and efficient manner. While methods for extracting temporal features were proposed, these methods either require the computation or estimation of optical flow, which demand high computational power and large storage resource; or extracts only linear feature along the temporal dimension which results in inferior performances. To extract temporal features without the utilization of optical flow, we propose Attentive Correlated Temporal Feature (ACTF) which leverages inter-frame correlation feature and exploits both bilinear and linear correlations between successive frames on the regional level. By excluding optical flow estimation or calculation, ACTF can be combined with any spatial feature extraction network under the two-stream structure for end-to-end training. Meanwhile, capturing long-range spatiotemporal dependencies is an effective strategy for extracting spatiotemporal correlation features. Previous works have proposed methods utilizing either hand-crafted features or stacks of convolution or recurrent modules, both of which are computationally inefficient, and cause difficulty in network optimization. While the more recent non-local block inspired by the non-local means could extract long-range dependencies without affecting the networks' optimization, it significantly increases the parameter size and computational cost of the inserted networks. To extract robust long-range dependencies more efficiently, we explore on further improving the non-local neural network by proposing a novel long-range spatiotemporal dependencies extraction module, the Pyramid Non-Local (PNL) module. It extends the original non-local block by incorporating regional feature correlations at multiple scales. PNL upscales the effectiveness of the original non-local block by additionally addressing the spatiotemporal correlations between different regions, while improving the efficiency of the original non-local block with a significant decrease in computation cost. Besides the development of recognition technologies, the progress made in vision-based action recognition could also be attributed to the development of large-scale video datasets, which enable the effective training of deep learning models. However, it could be observed that the majority of current research focuses on videos in normal illumination, partly due to the fact that current benchmark datasets for vision-based action recognition are normally collected from web videos shot mostly under normal illumination. Yet, we argue that vision-based action recognition should not be constrained in normal illuminated videos. Vision-based action recognition in dark videos are also useful in various scenarios, e.g., night surveillance and self-driving at night. Such a task has rarely been researched, partly due to the lack of sufficient datasets for such a task. To this end, this thesis bridges the gap of the lack of data and pioneers vision-based action recognition in dark videos by collecting a novel dataset: the Action Recognition in the Dark (ARID) dataset. In this thesis, the ARID dataset is analyzed thoroughly with a comprehensive benchmark of current deep learning methods. Meanwhile, though the introduction of ARID pioneers vision-based action recognition in dark videos and bridges the gap between the absence of dark video datasets with the need for research, the scale of such a dataset is relatively small compared to current large-scale video datasets. One solution for training robust models for domains with less labeled data would be to transfer models learned in well-labeled domains. However, models trained in one domain would not generalize well in the other domain due to domain shift across domains, which is presented as distribution discrepancy between different domains. Domain adaptation (DA) approaches address domain shifts and enable networks to be applied to different scenarios. Although various image DA approaches have been proposed in recent years, there is limited research towards Video Domain Adaptation (VDA), owing to the complexity in adapting the different modalities of features in videos, which includes both temporal features and spatiotemporal correlation features. We argue that the correlation features are highly associated with action classes and proven their effectiveness in accurate video feature extraction through supervised vision-based action recognition. Yet correlation features of the same action would differ across domains due to domain shift. Adversarial Correlation Adaptation Network (ACAN) is developed in this thesis to align action videos by aligning pixel correlations, while a novel HMDB-ARID dataset with a larger domain shift is built in an effort to leverage current datasets for vision-based action recognition in dark videos. We observe that while VDA methods enable the learning of transferable features across domains, these methods generally assume that the video source and target domains share an identical label space. Such an assumption may not hold in real-world applications. Instead, Partial Domain Adaptation (PDA) is a practical and general domain adaptation scenario, which relaxes the fully shared label space assumption such that the source label space subsumes the target one, and is more challenging than DA due to negative transfer caused by source-only classes. For videos, such negative transfer could be triggered by both spatial and temporal features, which leads to an even more challenging Partial Video Domain Adaptation (PVDA) problem. This thesis pioneers the PVDA problem by proposing a novel Partial Adversarial Temporal Attentive Network (PATAN) by utilizing both spatial and temporal features for filtering source-only classes. This thesis further introduces new benchmarks to facilitate research on PVDA problems, covering a wide range of PVDA scenarios. In summary, this thesis contributes to robust and efficient vision-based action recognition by introducing two algorithms for extracting robust and efficient temporal or spatiotemporal correlation features and pioneering in the research of robust vision-based action recognition in dark videos, breaking through the constraint of current vision-based action recognition research conducted on only normal illuminated videos.
author2 Mao Kezhi
author_facet Mao Kezhi
Xu, Yuecong
format Thesis-Doctor of Philosophy
author Xu, Yuecong
author_sort Xu, Yuecong
title Robust and efficient deep learning methods for vision-based action recognition
title_short Robust and efficient deep learning methods for vision-based action recognition
title_full Robust and efficient deep learning methods for vision-based action recognition
title_fullStr Robust and efficient deep learning methods for vision-based action recognition
title_full_unstemmed Robust and efficient deep learning methods for vision-based action recognition
title_sort robust and efficient deep learning methods for vision-based action recognition
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/153169
_version_ 1772829041205706752
spelling sg-ntu-dr.10356-1531692023-07-04T17:41:22Z Robust and efficient deep learning methods for vision-based action recognition Xu, Yuecong Mao Kezhi School of Electrical and Electronic Engineering Centre for system intelligence and efficiency (EXQUISITUS) EKZMao@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Vision-based action recognition, which performs action recognition based solely on RGB frames, has received strong research interest thanks to its wide applications in various fields, e.g. surveillance, smart homes, and autonomous driving. Significant progress has been made in vision-based action recognition thanks to the development of recognition technologies, particularly deep learning methods which have proven their effectiveness in visual recognition tasks, such as image classification. Compared to static images, videos contain additional information due to the additional temporal dimension, which includes both temporal and spatiotemporal correlation features. Therefore, the key to robust and efficient action recognition lies in the effective and efficient utilization of the temporal and spatiotemporal correlation features embedded within videos. In this thesis, we first investigate extracting temporal features in a robust and efficient manner. While methods for extracting temporal features were proposed, these methods either require the computation or estimation of optical flow, which demand high computational power and large storage resource; or extracts only linear feature along the temporal dimension which results in inferior performances. To extract temporal features without the utilization of optical flow, we propose Attentive Correlated Temporal Feature (ACTF) which leverages inter-frame correlation feature and exploits both bilinear and linear correlations between successive frames on the regional level. By excluding optical flow estimation or calculation, ACTF can be combined with any spatial feature extraction network under the two-stream structure for end-to-end training. Meanwhile, capturing long-range spatiotemporal dependencies is an effective strategy for extracting spatiotemporal correlation features. Previous works have proposed methods utilizing either hand-crafted features or stacks of convolution or recurrent modules, both of which are computationally inefficient, and cause difficulty in network optimization. While the more recent non-local block inspired by the non-local means could extract long-range dependencies without affecting the networks' optimization, it significantly increases the parameter size and computational cost of the inserted networks. To extract robust long-range dependencies more efficiently, we explore on further improving the non-local neural network by proposing a novel long-range spatiotemporal dependencies extraction module, the Pyramid Non-Local (PNL) module. It extends the original non-local block by incorporating regional feature correlations at multiple scales. PNL upscales the effectiveness of the original non-local block by additionally addressing the spatiotemporal correlations between different regions, while improving the efficiency of the original non-local block with a significant decrease in computation cost. Besides the development of recognition technologies, the progress made in vision-based action recognition could also be attributed to the development of large-scale video datasets, which enable the effective training of deep learning models. However, it could be observed that the majority of current research focuses on videos in normal illumination, partly due to the fact that current benchmark datasets for vision-based action recognition are normally collected from web videos shot mostly under normal illumination. Yet, we argue that vision-based action recognition should not be constrained in normal illuminated videos. Vision-based action recognition in dark videos are also useful in various scenarios, e.g., night surveillance and self-driving at night. Such a task has rarely been researched, partly due to the lack of sufficient datasets for such a task. To this end, this thesis bridges the gap of the lack of data and pioneers vision-based action recognition in dark videos by collecting a novel dataset: the Action Recognition in the Dark (ARID) dataset. In this thesis, the ARID dataset is analyzed thoroughly with a comprehensive benchmark of current deep learning methods. Meanwhile, though the introduction of ARID pioneers vision-based action recognition in dark videos and bridges the gap between the absence of dark video datasets with the need for research, the scale of such a dataset is relatively small compared to current large-scale video datasets. One solution for training robust models for domains with less labeled data would be to transfer models learned in well-labeled domains. However, models trained in one domain would not generalize well in the other domain due to domain shift across domains, which is presented as distribution discrepancy between different domains. Domain adaptation (DA) approaches address domain shifts and enable networks to be applied to different scenarios. Although various image DA approaches have been proposed in recent years, there is limited research towards Video Domain Adaptation (VDA), owing to the complexity in adapting the different modalities of features in videos, which includes both temporal features and spatiotemporal correlation features. We argue that the correlation features are highly associated with action classes and proven their effectiveness in accurate video feature extraction through supervised vision-based action recognition. Yet correlation features of the same action would differ across domains due to domain shift. Adversarial Correlation Adaptation Network (ACAN) is developed in this thesis to align action videos by aligning pixel correlations, while a novel HMDB-ARID dataset with a larger domain shift is built in an effort to leverage current datasets for vision-based action recognition in dark videos. We observe that while VDA methods enable the learning of transferable features across domains, these methods generally assume that the video source and target domains share an identical label space. Such an assumption may not hold in real-world applications. Instead, Partial Domain Adaptation (PDA) is a practical and general domain adaptation scenario, which relaxes the fully shared label space assumption such that the source label space subsumes the target one, and is more challenging than DA due to negative transfer caused by source-only classes. For videos, such negative transfer could be triggered by both spatial and temporal features, which leads to an even more challenging Partial Video Domain Adaptation (PVDA) problem. This thesis pioneers the PVDA problem by proposing a novel Partial Adversarial Temporal Attentive Network (PATAN) by utilizing both spatial and temporal features for filtering source-only classes. This thesis further introduces new benchmarks to facilitate research on PVDA problems, covering a wide range of PVDA scenarios. In summary, this thesis contributes to robust and efficient vision-based action recognition by introducing two algorithms for extracting robust and efficient temporal or spatiotemporal correlation features and pioneering in the research of robust vision-based action recognition in dark videos, breaking through the constraint of current vision-based action recognition research conducted on only normal illuminated videos. Doctor of Philosophy 2021-11-15T05:44:35Z 2021-11-15T05:44:35Z 2021 Thesis-Doctor of Philosophy Xu, Y. (2021). Robust and efficient deep learning methods for vision-based action recognition. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153169 https://hdl.handle.net/10356/153169 10.32657/10356/153169 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University