Human action recognition using artificial intelligence

Video action recognition is one of the specific tasks of video understanding, which aims to generate an action label, containing a verb and a noun, for a given video segment. As many other video understanding tasks, video action recognition is continuously under exploration of researchers and is at...

Full description

Saved in:
Bibliographic Details
Main Author: Wang, Haoyu
Other Authors: Yap Kim Hui
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/157639
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Video action recognition is one of the specific tasks of video understanding, which aims to generate an action label, containing a verb and a noun, for a given video segment. As many other video understanding tasks, video action recognition is continuously under exploration of researchers and is at the same time, extensively applied to many real-life applications, like automatic driving, human-robot interaction, etc. Former researchers have established several different methods, including hand-crafted features, two-stream networks, 3D CNNs, etc. The fundamental difference among those methods is that they use different spatial-temporal modelling to capture both the spatial details and temporal relation in video segments, which are the keys for video tasks. However, due to the complexity of modelling such information, trade-off must always be made between a high accuracy and computational cost. Beside the prediction model, dataset is also crucial to video tasks as its scale and variety in action categories definitely help models pre-trained on it work better when deployed in real-life applications. In this project, a survey about various former action recognition method and action recognition dataset was conducted in order to comprehensively understand the problems mentioned above, and to evaluate and compare across the performance of the existing state-of-the-art methods. Then an efficient deep learning model was proposed to take advantage of 1) the cheap computation of 2D CNNs, 2) the ability of long-range temporal modelling of two-stream networks and 3D CNNs. The largest dataset in egocentric vision was selected as the benchmark dataset to compare the proposed model over its baseline. Extensive experiments were designed and conducted to analyse the results, which showed the proposed method has single digit accuracy improvement over the state-of-the-art. This report consists of the insights gained from survey about video action recognition models and dataset, the design of an efficient models, the experiment results with comparisons and discussions, and most important, the reflection about the design and development of the model and its performance. A short conclusion and a glimpse towards future works are made at the end.