Egocentric activities of daily living recognition application using android platform

Human action recognition (HAR) systems determine the action being done by a person using a variety of algorithms. New applications in the domain have been leveraging the compactness as well as the multiple capabilities of smartphones in terms of sensing and processing power. These recent studies hav...

Full description

Saved in:
Bibliographic Details
Main Author: Canlas, Reich Rechner D.
Format: text
Language:English
Published: Animo Repository 2018
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/etd_masteral/5632
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
Description
Summary:Human action recognition (HAR) systems determine the action being done by a person using a variety of algorithms. New applications in the domain have been leveraging the compactness as well as the multiple capabilities of smartphones in terms of sensing and processing power. These recent studies have primarily depended on motion inputs captured by either a camera or an array of sensors. Rarely in the literature have both camera and mechanical sensor signals been used simultaneously in an HAR system. Taking all these into account, this study aims to develop a Human Action Recognition (HAR) application that uses both first-person perspective camera and sensor inputs on a device hosting the Android platform to improve upon existing egocentric HAR systems in terms of efficiency, portability, and accuracy. There were four input streams considered, namely, camera, accelerometer, gyroscope, and magnetometer. Each stream was fed into a combination of one- and two-dimensional Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) Recurrent Neural Networks. All of these streams and networks work in parallel, but their individual classifications were fed into a fully-connected late fusion network. The accuracy and other metrics of the system were evaluated on a selection of actions from both a reference dataset and a new dataset generated from video-sensor data pairs. Results have shown that each of the networks was effective at recognizing the actions considered within the scope of this study, but even more so when their outputs were fused.