Detecting human-object interactions for human activity analysis

A long-standing goal in the field of computer vision is to develop models that can understand the rich visual world and recognize the diverse activities within. We have witnessed significant strides towards this goal over the last few years due to the availability of mass data and rapid advances in...

Full description

Saved in:
Bibliographic Details
Main Author: Wang, Suchen
Other Authors: Tan Yap Peng
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/165042
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-165042
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Wang, Suchen
Detecting human-object interactions for human activity analysis
description A long-standing goal in the field of computer vision is to develop models that can understand the rich visual world and recognize the diverse activities within. We have witnessed significant strides towards this goal over the last few years due to the availability of mass data and rapid advances in computing resources and deep learning algorithms. Computers can now detect person instances from images or videos, classify actions and recognize the interacting objects. However, most of the advances focus on assigning one or a few labels in a pre-determined small category space (e.g., riding a bicycle, opening a bottle, etc.), which only uncovers the tip of the iceberg of diverse human daily activities. In this thesis, we develop models that allow us to detect human interactions with a wide range of common objects. In particular, we first assemble a large-vocabulary dataset and propose a one-stage detector that takes an image as input and directly outputs a set of interaction tuples. We demonstrate that human visual cues (e.g., human pose, spatial location, etc.) can provide informative priors for searching interacting objects and recognizing interactions. We empirically show that the proposed one-stage HOI detector can detect 23 times more (from 600 to 14,000) interactions and achieve 25% mAP improvement over state-of-the-art methods. Second, we develop a model that embeds visual objects and category names into a joint embedding space. We present a way to identify novel objects based on the knowledge obtained from known object categories. We empirically show that the proposed zero-shot HOI detector can achieve over 24% mAP improvement on human interactions with unseen objects. Third, we introduce a model that can learn to detect human-object interactions based on natural language descriptions instead of pre-determined discrete labels. We demonstrate that this model is transferable to 1,800 unseen interactions with a significant mAP improvement (from 6.21 to 10.04). In the end, we argue that these models can offer many practical benefits and immediate valuable applications. The proposed HOI detectors can be applied to extract discriminative action features for downstream tasks, e.g., video summarization and human activity understanding. We expect these techniques can serve as a stepping stone toward a more comprehensive understanding of human activities. One side objective of this thesis is to address the challenges brought by limited training data. Compared to the unlimited human activities in the visual world, there is inherently only a small portion of interactions that can be represented by the labeled data. From this perspective, our contribution lies in the design of algorithms to handle the potential novel interactions beyond the collected category space, including unseen objects, novel combinations between seen actions and objects, etc. From the modeling perspective, instead of designing complex multi-stage frameworks, our contribution lies in the design of one-stage architectures that take an image and directly produce the interaction tuples with a single network. We formulate the task as a multi-task optimization problem and learn all module components with a shared objective function. We show that our methods outperform state-of-the-art HOI detection approaches, and they can help facilitate the visual understanding of rich human activities in our visual world.
author2 Tan Yap Peng
author_facet Tan Yap Peng
Wang, Suchen
format Thesis-Doctor of Philosophy
author Wang, Suchen
author_sort Wang, Suchen
title Detecting human-object interactions for human activity analysis
title_short Detecting human-object interactions for human activity analysis
title_full Detecting human-object interactions for human activity analysis
title_fullStr Detecting human-object interactions for human activity analysis
title_full_unstemmed Detecting human-object interactions for human activity analysis
title_sort detecting human-object interactions for human activity analysis
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/165042
_version_ 1772826096986750976
spelling sg-ntu-dr.10356-1650422023-07-04T16:24:05Z Detecting human-object interactions for human activity analysis Wang, Suchen Tan Yap Peng School of Electrical and Electronic Engineering EYPTan@ntu.edu.sg Engineering::Computer science and engineering A long-standing goal in the field of computer vision is to develop models that can understand the rich visual world and recognize the diverse activities within. We have witnessed significant strides towards this goal over the last few years due to the availability of mass data and rapid advances in computing resources and deep learning algorithms. Computers can now detect person instances from images or videos, classify actions and recognize the interacting objects. However, most of the advances focus on assigning one or a few labels in a pre-determined small category space (e.g., riding a bicycle, opening a bottle, etc.), which only uncovers the tip of the iceberg of diverse human daily activities. In this thesis, we develop models that allow us to detect human interactions with a wide range of common objects. In particular, we first assemble a large-vocabulary dataset and propose a one-stage detector that takes an image as input and directly outputs a set of interaction tuples. We demonstrate that human visual cues (e.g., human pose, spatial location, etc.) can provide informative priors for searching interacting objects and recognizing interactions. We empirically show that the proposed one-stage HOI detector can detect 23 times more (from 600 to 14,000) interactions and achieve 25% mAP improvement over state-of-the-art methods. Second, we develop a model that embeds visual objects and category names into a joint embedding space. We present a way to identify novel objects based on the knowledge obtained from known object categories. We empirically show that the proposed zero-shot HOI detector can achieve over 24% mAP improvement on human interactions with unseen objects. Third, we introduce a model that can learn to detect human-object interactions based on natural language descriptions instead of pre-determined discrete labels. We demonstrate that this model is transferable to 1,800 unseen interactions with a significant mAP improvement (from 6.21 to 10.04). In the end, we argue that these models can offer many practical benefits and immediate valuable applications. The proposed HOI detectors can be applied to extract discriminative action features for downstream tasks, e.g., video summarization and human activity understanding. We expect these techniques can serve as a stepping stone toward a more comprehensive understanding of human activities. One side objective of this thesis is to address the challenges brought by limited training data. Compared to the unlimited human activities in the visual world, there is inherently only a small portion of interactions that can be represented by the labeled data. From this perspective, our contribution lies in the design of algorithms to handle the potential novel interactions beyond the collected category space, including unseen objects, novel combinations between seen actions and objects, etc. From the modeling perspective, instead of designing complex multi-stage frameworks, our contribution lies in the design of one-stage architectures that take an image and directly produce the interaction tuples with a single network. We formulate the task as a multi-task optimization problem and learn all module components with a shared objective function. We show that our methods outperform state-of-the-art HOI detection approaches, and they can help facilitate the visual understanding of rich human activities in our visual world. Doctor of Philosophy 2023-03-10T03:27:58Z 2023-03-10T03:27:58Z 2023 Thesis-Doctor of Philosophy Wang, S. (2023). Detecting human-object interactions for human activity analysis. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/165042 https://hdl.handle.net/10356/165042 10.32657/10356/165042 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University