Reasoning over multiple human-human interaction activities

Humans are naturally social, engaging in a wide range of interactions with other individuals. These interaction activities can have multiple people involved or just two individuals, and comprise mutual-actions with diverse degrees of complexity and distinct intentions. Therefore, it is still very ch...

Full description

Saved in:
Bibliographic Details
Main Author: Perez, Mauricio Lisboa
Other Authors: Alex Chichung KOT
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/151926
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Humans are naturally social, engaging in a wide range of interactions with other individuals. These interaction activities can have multiple people involved or just two individuals, and comprise mutual-actions with diverse degrees of complexity and distinct intentions. Therefore, it is still very challenging for machines to autonomously identify them. In this thesis, we explore and propose different techniques to reason about human interactions under distinct scenarios, and exploit multiple types of data. Person-person mutual action recognition (also referred to as interaction recognition) is an important research branch of human activity analysis. Current methods in the field -- mainly dominated by Convolutional Neural Networks (CNNs), Graph Convolutional Networks (GCNs) and Long Short-Term Memory (LSTMs) -- often consist of complicated architectures and mechanisms to embed the relationship between the two persons on the architecture itself, to ensure that the interaction patterns can be properly learned. Our main contribution in the first work of this thesis is to propose a simpler yet very powerful architecture, named Interaction Relational Network (IRN), which utilizes minimal prior knowledge about the structure of the human body. We drive the network to identify by itself how to relate the body parts from the individuals interacting. In order to better represent the interaction, we define two types of relationships, leading to specialized architectures and models for each relationship type. These multiple relationship models will then be fused into a single and special architecture, in order to leverage both streams of information for further enhancing the relational reasoning capability. Furthermore we define important structured pair-wise operations to extract additional information from each pair of joints -- distance and motion. Ultimately, with the coupling of an LSTM, our proposed method is capable of paramount sequential relational reasoning. Our solution is able to achieve state-of-the-art performance on the traditional interaction recognition datasets SBU Kinect Interaction and UT-Interaction, and also on the mutual actions from the large-scale dataset NTU RGB+D. Furthermore, it obtains competitive performance in the NTU RGB+D 120 dataset interactions subset. Research on group activity recognition mostly leans on the standard two-stream approach (RGB and Optical Flow) as their input features. Few have explored explicit pose information, with none using it directly to reason about the persons interactions. As our second work in this thesis, we leverage the skeleton information to learn the interactions between the individuals straight from it. With our proposed method Group Interaction Relational Network (GIRN), multiple relationship types are inferred from independent modules, that describe the relations between the body joints pair-by-pair. Additionally to the joints relations, we also experiment with the previously unexplored relationship between individuals and relevant objects (e.g. volleyball). The individuals distinct relations are then merged through an attention mechanism, that gives more importance to those individuals more relevant for distinguishing the group activity. We evaluate our method in the Volleyball dataset, obtaining competitive results to the state-of-the-art, even though using a single modality. Therefore, our experiments demonstrate the potential of skeleton-based approaches for modeling multi-person interactions. CCTVs have since long been used to enforce safety, for example to detect fights arising from many different situations. But their effectiveness is questionable, because they rely on continuous and specialized human supervision, thus requiring automated solutions for improved efficiency. However, previous works in the field of fight detection are either too superficial (classification of short-clips) or unrealistic (movies, sports, fake fights). Moreover, none has performed detection of actual fights on long duration CCTV recordings. In this thesis' third work, we tackle specifically this problem. Firstly we propose CCTV-Fights, a novel and challenging dataset containing 1,000 videos of real fights, with more than 8 hours of annotated CCTV footage. Then we propose a benchmark pipeline, on which we assess the impact of different feature extractors (Two-stream CNN, 3D CNN and a local interest point descriptor), as well as different classifiers (such as end-to-end CNN, LSTM and SVM). Results confirm how challenging the problem is, and highlight the importance of explicit motion information to improve performance.