Interpreting models for video action recognition
Action recognition is the task of identifying human actions in videos. This has been a long standing challenge in Computer Vision. Earlier methods relied on hand-crafted features and traditional machine learning algorithms to solve this task. In the past decade or so, deep learning has replaced thes...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/148367 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Action recognition is the task of identifying human actions in videos. This has been a long standing challenge in Computer Vision. Earlier methods relied on hand-crafted features and traditional machine learning algorithms to solve this task. In the past decade or so, deep learning has replaced these early methods.
Traditional machine learning models like decision trees are easier to be interpreted than complex deep neural networks. Deep learning has gained so much popularity in the early 2010s thanks to its ability to achieve wonderful feats in various complex tasks such as Action recognition task. However, due to the complex innerworkings of deep neural networks, the interpretability of these models has been more challenging than ever.
In this project, a study is conducted to interpret the deep neural networks in Action recognition. To examine this, we perform network dissection on the model trained on the UCF-101 [1] dataset for action recognition tasks. The focus will be placed on systematically identifying the semantics of individual hidden units within the model, followed by understanding the role of the units in the model based on the visual concepts that are captured by them.
Specifically, the change in accuracy of the network in classifying each action is analyzed when a unit is eliminated. This is to determine the importance of each unit for each action. The impact on the network’s accuracy when removing important and irrelevant units for each class will thus be discussed. It is found out that the network relies on salient objects or cues to classify the action. For example, in our experiment, the network relies on the surrounding objects such as carpet to detect the BabyCrawling action. |
---|