Multi-view human action recognition in meeting scenarios

Due to the continuous development in deep learning and computer vision, the recognition of human actions has become one of the most popular research topics. Various methods have been proposed to tackle this problem. This project implements a Multi-View Human Action Recognition System with focus on s...

Full description

Saved in:
Bibliographic Details
Main Author: Yin, Haixiang
Other Authors: Tan Yap Peng
Format: Thesis-Master by Coursework
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/153357
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Due to the continuous development in deep learning and computer vision, the recognition of human actions has become one of the most popular research topics. Various methods have been proposed to tackle this problem. This project implements a Multi-View Human Action Recognition System with focus on spatio-temporal localization of actions in a meeting scenario. Existing human action recognition systems tend to face the problem of human-to-human or human-to-object occlusion in some cases. This can greatly affect the recognition accuracy. Most of the existing multi-view action recognition systems also do not focus on the spatio-temporal localization of actions. However, the problem of occlusion in meeting scenarios is a frequent phenomenon. Once it occurs, it can persist for a long time. Hence, existing methods and datasets do not work well in this scenario. This project aims to address the above limitations. We first process a multi-view meeting dataset, AMI (Augmented Multi-party Interaction) meeting corpus. To make it can be used for multi-view action recognition. In addition, we use SlowFast Network as a backbone network for action recognition and use Torchreid (A library for deep learning person re-identification in PyTorch) for instance association after learning the features of the input from different camera viewpoints. And finally, the system uses the method of late fusion to fuse the information from the left and right viewpoints into the center viewpoint that has occlusion problem. This method will improve the system's ability to deal with the occlusion problem. The method proposed in this project can improve by up to nearly 10 percent of the mAP (Mean Average Precision) on AMI meeting corpus compared to single-view recognition approaches.