Multi-view human action recognition in meeting scenarios
Due to the continuous development in deep learning and computer vision, the recognition of human actions has become one of the most popular research topics. Various methods have been proposed to tackle this problem. This project implements a Multi-View Human Action Recognition System with focus on s...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Coursework |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/153357 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Due to the continuous development in deep learning and computer vision, the recognition of human actions has become one of the most popular research topics. Various methods have been proposed to tackle this problem. This project implements a Multi-View Human Action Recognition System with focus on spatio-temporal localization of actions in a meeting scenario. Existing human action recognition systems tend to face the problem of human-to-human or human-to-object occlusion in some cases. This can greatly affect the recognition accuracy. Most of the existing multi-view action recognition systems also do not focus on the spatio-temporal localization of actions. However, the problem of occlusion in meeting scenarios is a frequent phenomenon. Once it occurs, it can persist for a long time. Hence, existing methods and datasets do not work well in this scenario.
This project aims to address the above limitations. We first process a multi-view meeting dataset, AMI (Augmented Multi-party Interaction) meeting corpus. To make it can be used for multi-view action recognition. In addition, we use SlowFast Network as a backbone network for action recognition and use Torchreid (A library for deep learning person re-identification in PyTorch) for instance association after learning the features of the input from different camera viewpoints. And finally, the system uses the method of late fusion to fuse the information from the left and right viewpoints into the center viewpoint that has occlusion problem. This method will improve the system's ability to deal with the occlusion problem.
The method proposed in this project can improve by up to nearly 10 percent of the mAP (Mean Average Precision) on AMI meeting corpus compared to single-view recognition approaches. |
---|