Deep neural network approach to predict actions from videos

Deep convolutional neural networks have lately dominated scene understanding tasks, particularly those pertaining to still images. Recently, these networks have been adapted and employed for action recognition from videos but the improvements over traditional methods are not as drastic when compared...

Full description

Saved in:

Bibliographic Details
Main Author:	Garg, Utsav
Other Authors:	Jagath C. Rajapakse
Format:	Final Year Project
Language:	English
Published:	2018
Subjects:	DRNTU::Engineering
Online Access:	http://hdl.handle.net/10356/74085
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-74085
record_format	dspace
spelling	sg-ntu-dr.10356-740852023-03-03T20:40:34Z Deep neural network approach to predict actions from videos Garg, Utsav Jagath C. Rajapakse School of Computer Science and Engineering Bioinformatics Research Centre DRNTU::Engineering Deep convolutional neural networks have lately dominated scene understanding tasks, particularly those pertaining to still images. Recently, these networks have been adapted and employed for action recognition from videos but the improvements over traditional methods are not as drastic when compared to still images. This can be attributed to the lack of focus on modeling the inherent temporal dependency that exists between the frames of a video. In this work, we investigate the various approaches that have been proposed for this task and understand the importance of different aspects of the network such as the input pipeline, frame aggregation methods, loss functions etc. Moreover, we incorporate a Long Short Term Memory(LSTM) layer into some of these approaches in order to better model the temporal dependency between the frames. The addition of LSTM is alluring as it can model sequences of variable lengths unlike approaches based on just convolutions which require a uniform input structure. We also explore the importance of different input modalities. In still image classification, the only input stream is RGB images but for videos, one can also extract the dense optical flow between frames to highlight areas of major motion. Therefore, we run experiments on both these modalities and also find the best ways to fuse the scores from both of them. These ideas are validated through multiple experiments using different architectures on the UCF-101 benchmark dataset, attaining results that are competitive with various state-of-the-art approaches. Through these modifications, we gained a max performance improvement of 6% on one of the architectures, increased the efficiency of another by over 25% and validated many more ideas which offer comparable performance. Bachelor of Engineering (Computer Science) 2018-04-24T05:52:49Z 2018-04-24T05:52:49Z 2018 Final Year Project (FYP) http://hdl.handle.net/10356/74085 en Nanyang Technological University 53 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering
spellingShingle	DRNTU::Engineering Garg, Utsav Deep neural network approach to predict actions from videos
description	Deep convolutional neural networks have lately dominated scene understanding tasks, particularly those pertaining to still images. Recently, these networks have been adapted and employed for action recognition from videos but the improvements over traditional methods are not as drastic when compared to still images. This can be attributed to the lack of focus on modeling the inherent temporal dependency that exists between the frames of a video. In this work, we investigate the various approaches that have been proposed for this task and understand the importance of different aspects of the network such as the input pipeline, frame aggregation methods, loss functions etc. Moreover, we incorporate a Long Short Term Memory(LSTM) layer into some of these approaches in order to better model the temporal dependency between the frames. The addition of LSTM is alluring as it can model sequences of variable lengths unlike approaches based on just convolutions which require a uniform input structure. We also explore the importance of different input modalities. In still image classification, the only input stream is RGB images but for videos, one can also extract the dense optical flow between frames to highlight areas of major motion. Therefore, we run experiments on both these modalities and also find the best ways to fuse the scores from both of them. These ideas are validated through multiple experiments using different architectures on the UCF-101 benchmark dataset, attaining results that are competitive with various state-of-the-art approaches. Through these modifications, we gained a max performance improvement of 6% on one of the architectures, increased the efficiency of another by over 25% and validated many more ideas which offer comparable performance.
author2	Jagath C. Rajapakse
author_facet	Jagath C. Rajapakse Garg, Utsav
format	Final Year Project
author	Garg, Utsav
author_sort	Garg, Utsav
title	Deep neural network approach to predict actions from videos
title_short	Deep neural network approach to predict actions from videos
title_full	Deep neural network approach to predict actions from videos
title_fullStr	Deep neural network approach to predict actions from videos
title_full_unstemmed	Deep neural network approach to predict actions from videos
title_sort	deep neural network approach to predict actions from videos
publishDate	2018
url	http://hdl.handle.net/10356/74085
_version_	1759853945530875904

Deep neural network approach to predict actions from videos

Similar Items