Detecting human-object interactions for human activity analysis

A long-standing goal in the field of computer vision is to develop models that can understand the rich visual world and recognize the diverse activities within. We have witnessed significant strides towards this goal over the last few years due to the availability of mass data and rapid advances in...

Full description

Saved in:

Bibliographic Details
Main Author:	Wang, Suchen
Other Authors:	Tan Yap Peng
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2023
Subjects:	Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/165042
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-165042
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering
spellingShingle	Engineering::Computer science and engineering Wang, Suchen Detecting human-object interactions for human activity analysis
description	A long-standing goal in the field of computer vision is to develop models that can understand the rich visual world and recognize the diverse activities within. We have witnessed significant strides towards this goal over the last few years due to the availability of mass data and rapid advances in computing resources and deep learning algorithms. Computers can now detect person instances from images or videos, classify actions and recognize the interacting objects. However, most of the advances focus on assigning one or a few labels in a pre-determined small category space (e.g., riding a bicycle, opening a bottle, etc.), which only uncovers the tip of the iceberg of diverse human daily activities. In this thesis, we develop models that allow us to detect human interactions with a wide range of common objects. In particular, we first assemble a large-vocabulary dataset and propose a one-stage detector that takes an image as input and directly outputs a set of interaction tuples. We demonstrate that human visual cues (e.g., human pose, spatial location, etc.) can provide informative priors for searching interacting objects and recognizing interactions. We empirically show that the proposed one-stage HOI detector can detect 23 times more (from 600 to 14,000) interactions and achieve 25% mAP improvement over state-of-the-art methods. Second, we develop a model that embeds visual objects and category names into a joint embedding space. We present a way to identify novel objects based on the knowledge obtained from known object categories. We empirically show that the proposed zero-shot HOI detector can achieve over 24% mAP improvement on human interactions with unseen objects. Third, we introduce a model that can learn to detect human-object interactions based on natural language descriptions instead of pre-determined discrete labels. We demonstrate that this model is transferable to 1,800 unseen interactions with a significant mAP improvement (from 6.21 to 10.04). In the end, we argue that these models can offer many practical benefits and immediate valuable applications. The proposed HOI detectors can be applied to extract discriminative action features for downstream tasks, e.g., video summarization and human activity understanding. We expect these techniques can serve as a stepping stone toward a more comprehensive understanding of human activities. One side objective of this thesis is to address the challenges brought by limited training data. Compared to the unlimited human activities in the visual world, there is inherently only a small portion of interactions that can be represented by the labeled data. From this perspective, our contribution lies in the design of algorithms to handle the potential novel interactions beyond the collected category space, including unseen objects, novel combinations between seen actions and objects, etc. From the modeling perspective, instead of designing complex multi-stage frameworks, our contribution lies in the design of one-stage architectures that take an image and directly produce the interaction tuples with a single network. We formulate the task as a multi-task optimization problem and learn all module components with a shared objective function. We show that our methods outperform state-of-the-art HOI detection approaches, and they can help facilitate the visual understanding of rich human activities in our visual world.
author2	Tan Yap Peng
author_facet	Tan Yap Peng Wang, Suchen
format	Thesis-Doctor of Philosophy
author	Wang, Suchen
author_sort	Wang, Suchen
title	Detecting human-object interactions for human activity analysis
title_short	Detecting human-object interactions for human activity analysis
title_full	Detecting human-object interactions for human activity analysis
title_fullStr	Detecting human-object interactions for human activity analysis
title_full_unstemmed	Detecting human-object interactions for human activity analysis
title_sort	detecting human-object interactions for human activity analysis
publisher	Nanyang Technological University
publishDate	2023
url	https://hdl.handle.net/10356/165042
_version_	1772826096986750976
spelling	sg-ntu-dr.10356-1650422023-07-04T16:24:05Z Detecting human-object interactions for human activity analysis Wang, Suchen Tan Yap Peng School of Electrical and Electronic Engineering EYPTan@ntu.edu.sg Engineering::Computer science and engineering A long-standing goal in the field of computer vision is to develop models that can understand the rich visual world and recognize the diverse activities within. We have witnessed significant strides towards this goal over the last few years due to the availability of mass data and rapid advances in computing resources and deep learning algorithms. Computers can now detect person instances from images or videos, classify actions and recognize the interacting objects. However, most of the advances focus on assigning one or a few labels in a pre-determined small category space (e.g., riding a bicycle, opening a bottle, etc.), which only uncovers the tip of the iceberg of diverse human daily activities. In this thesis, we develop models that allow us to detect human interactions with a wide range of common objects. In particular, we first assemble a large-vocabulary dataset and propose a one-stage detector that takes an image as input and directly outputs a set of interaction tuples. We demonstrate that human visual cues (e.g., human pose, spatial location, etc.) can provide informative priors for searching interacting objects and recognizing interactions. We empirically show that the proposed one-stage HOI detector can detect 23 times more (from 600 to 14,000) interactions and achieve 25% mAP improvement over state-of-the-art methods. Second, we develop a model that embeds visual objects and category names into a joint embedding space. We present a way to identify novel objects based on the knowledge obtained from known object categories. We empirically show that the proposed zero-shot HOI detector can achieve over 24% mAP improvement on human interactions with unseen objects. Third, we introduce a model that can learn to detect human-object interactions based on natural language descriptions instead of pre-determined discrete labels. We demonstrate that this model is transferable to 1,800 unseen interactions with a significant mAP improvement (from 6.21 to 10.04). In the end, we argue that these models can offer many practical benefits and immediate valuable applications. The proposed HOI detectors can be applied to extract discriminative action features for downstream tasks, e.g., video summarization and human activity understanding. We expect these techniques can serve as a stepping stone toward a more comprehensive understanding of human activities. One side objective of this thesis is to address the challenges brought by limited training data. Compared to the unlimited human activities in the visual world, there is inherently only a small portion of interactions that can be represented by the labeled data. From this perspective, our contribution lies in the design of algorithms to handle the potential novel interactions beyond the collected category space, including unseen objects, novel combinations between seen actions and objects, etc. From the modeling perspective, instead of designing complex multi-stage frameworks, our contribution lies in the design of one-stage architectures that take an image and directly produce the interaction tuples with a single network. We formulate the task as a multi-task optimization problem and learn all module components with a shared objective function. We show that our methods outperform state-of-the-art HOI detection approaches, and they can help facilitate the visual understanding of rich human activities in our visual world. Doctor of Philosophy 2023-03-10T03:27:58Z 2023-03-10T03:27:58Z 2023 Thesis-Doctor of Philosophy Wang, S. (2023). Detecting human-object interactions for human activity analysis. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/165042 https://hdl.handle.net/10356/165042 10.32657/10356/165042 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Detecting human-object interactions for human activity analysis

Similar Items