Detecting human-object interactions for human activity analysis
A long-standing goal in the field of computer vision is to develop models that can understand the rich visual world and recognize the diverse activities within. We have witnessed significant strides towards this goal over the last few years due to the availability of mass data and rapid advances in...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/165042 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | A long-standing goal in the field of computer vision is to develop models that can understand the rich visual world and recognize the diverse activities within. We have witnessed significant strides towards this goal over the last few years due to the availability of mass data and rapid advances in computing resources and deep learning algorithms. Computers can now detect person instances from images or videos, classify actions and recognize the interacting objects. However, most of the advances focus on assigning one or a few labels in a pre-determined small category space (e.g., riding a bicycle, opening a bottle, etc.), which only uncovers the tip of the iceberg of diverse human daily activities.
In this thesis, we develop models that allow us to detect human interactions with a wide range of common objects. In particular, we first assemble a large-vocabulary dataset and propose a one-stage detector that takes an image as input and directly outputs a set of interaction tuples. We demonstrate that human visual cues (e.g., human pose, spatial location, etc.) can provide informative priors for searching interacting objects and recognizing interactions. We empirically show that the proposed one-stage HOI detector can detect 23 times more (from 600 to 14,000) interactions and achieve 25% mAP improvement over state-of-the-art methods. Second, we develop a model that embeds visual objects and category names into a joint embedding space. We present a way to identify novel objects based on the knowledge obtained from known object categories. We empirically show that the proposed zero-shot HOI detector can achieve over 24% mAP improvement on human interactions with unseen objects. Third, we introduce a model that can learn to detect human-object interactions based on natural language descriptions instead of pre-determined discrete labels. We demonstrate that this model is transferable to 1,800 unseen interactions with a significant mAP improvement (from 6.21 to 10.04). In the end, we argue that these models can offer many practical benefits and immediate valuable applications. The proposed HOI detectors can be applied to extract discriminative action features for downstream tasks, e.g., video summarization and human activity understanding. We expect these techniques can serve as a stepping stone toward a more comprehensive understanding of human activities.
One side objective of this thesis is to address the challenges brought by limited training data. Compared to the unlimited human activities in the visual world, there is inherently only a small portion of interactions that can be represented by the labeled data. From this perspective, our contribution lies in the design of algorithms to handle the potential novel interactions beyond the collected category space, including unseen objects, novel combinations between seen actions and objects, etc. From the modeling perspective, instead of designing complex multi-stage frameworks, our contribution lies in the design of one-stage architectures that take an image and directly produce the interaction tuples with a single network. We formulate the task as a multi-task optimization problem and learn all module components with a shared objective function. We show that our methods outperform state-of-the-art HOI detection approaches, and they can help facilitate the visual understanding of rich human activities in our visual world. |
---|