Pose-invariant action recognition for automated behaviour analysis

Computer vision deals with providing visual capabilities to a computer so that it can understand its surrounding environment. This has given rise to numerous applications such as human-computer interactions, object detection, scene understanding, surveillance etc. Understanding human behaviour is a...

Full description

Saved in:
Bibliographic Details
Main Author: Manoj Ramanathan
Other Authors: Yau Wei Yun
Format: Theses and Dissertations
Language:English
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/10356/70099
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Computer vision deals with providing visual capabilities to a computer so that it can understand its surrounding environment. This has given rise to numerous applications such as human-computer interactions, object detection, scene understanding, surveillance etc. Understanding human behaviour is a difficult task due to its fickle nature. A person’s action is one of the visible clues towards one’s behaviour or intention. Due to this, research in action recognition has gained momentum in recent years. Actions can be mainly characterized by the motion of body parts and evolution of the pose of the person. In an unconstrained setting, achieving a clear and distinct action representation is very difficult due to pose variations caused as a result of variation in camera view angle changes, occlusion of body parts, cluttered background etc. Also, traditional action recognition datasets portray actions that have subjects mainly in upright postures such as walking, running, boxing etc. There is lack of study related non-upright posture action recognition. Thus in this research, we develop a novel pose-invariant action recognition which combines both motion of body parts and pose of the person in a mutually reinforcing manner. The proposed method is generalized to handle possible human posture variations (including non-upright postures). The assumption made is that the neck point of the person and major viewing direction of the person during motion are available in the frame. Our pose-invariant action recognition framework is inspired from closed loop framework of control system theories with two components, namely, pose-invariant motion feature extraction component (Forward path) and canonical pose component (Feedback path) with the output of each used as input to the other to reinforce the initial prediction in a manner that improves the overall final result. The two components of our proposed action recognition framework are • Pose-Invariant Motion Component or Propagation Motion Forward (PMF) Path In this component, we aim to capture motion of different body parts that characterizes an action. To achieve this, a propagation mechanism is used to divide the region of interest is divided into 3 grids corresponding to the head, torso and legs respectively based on the estimated body orientation of the person in the frame. To capture the motion in each grid, we use a set of kinematic features derived from optical flow and encode them in a pose-invariant manner by converting them to a human-body centric space. Using these initial set of extracted invariant kinematic motion features, we provide initial hypothesis of possible actions. • Canonical Pose Component or Canonical Pose Feedback (CPF) Path The initial action recognized based on invariant kinematic motion features are used as input by this component. Each action is characterized by specific poses, we call these as canonical stick poses. Given an action and available training videos, we propose an algorithm to extract normalized canonical stick poses that can be compared with the test videos. For comparing these stick poses with the visual data, we propose a novel pose hypothesis generation scheme that compares each of the extracted canonical sticks of the action recognized by the first component with the video frame to identify the most likely canonical stick pose in it. The identified canonical stick pose in the frame is used to improve the pose-invariant motion feature extraction by realigning the grids used in the first component and also compute another kinematic motion feature of each body part with respect to the center of that body part. To capture temporal dynamics of action, identified stick poses are used to compute a new set of temporal stick features. The new set of pose-invariant motion features extracted from both forward and feedback paths using the realigned grids serves as the final action representation. Using these features, action hypothesis is refined until the action converges to yield the recognized action. In our canonical pose feedback path, to compare each canonical stick pose with the test frame, we propose a canonical pose hypothesis scheme that uses a body part detector as it’s basis. Most body part detectors or person detectors available work based on the assumption that the person is in an upright posture. Therefore, we propose our own body part detector that can work even when the person is in non-upright postures. In our body part detector, we do not impose any connectivity or shape constraints between the parts allowing us to detect body parts in non-upright postures. We divide the human body into 4 body parts head, torso, arm and leg. For detecting these body parts, training images were collected from the web using a python crawler. Using the training images, body part classifiers were trained. Based on the Bayes’ theorem for conditional probability, we propose an algorithm to compute the likelihood scores for each of the detected parts in the frame. The algorithm uses the developed classifier’s output to determine a likelihood score for the detected body parts. This likelihood score is used by the pose hypothesis generation scheme of the canonical pose component. To test our method’s effectiveness under non-upright posture, we introduce a new action dataset comprising of 8 actions in different postures captured from 3 views. The dataset comprises of visual data for 35 subjects. We have conducted several experiments on publicly available benchmark action datasets such as Weizmann, KTH, UCF Sports, Hollywood, HMDB51 and this new dataset to test the effectiveness of our recognition framework. The results show that the proposed reinforcement framework for action recognition is pose-invariant, partially view-invariant and is able to work even if there is partial occlusion of the person performing the action.