Deep neural networks based visual object detection in videos

As a fundamental task in computer vision, object detection is to locate visual objects of pre-defined classes in instance level. Most existing object detection methods are proposed for static images, which usually try to utilize visual information to accurately locate and recognize an object. It is...

Full description

Saved in:
Bibliographic Details
Main Author: Jin, Ruibing
Other Authors: Lin Guosheng
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/144136
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-144136
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
Engineering::Electrical and electronic engineering
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
Engineering::Electrical and electronic engineering
Jin, Ruibing
Deep neural networks based visual object detection in videos
description As a fundamental task in computer vision, object detection is to locate visual objects of pre-defined classes in instance level. Most existing object detection methods are proposed for static images, which usually try to utilize visual information to accurately locate and recognize an object. It is usually influenced by the object scale, context information, partial occlusion and so on. Different from object detection in static images, video object detection is additionally affected by the deteriorated object appearances, like motion blur and objects under extreme viewpoints. Affected by these deteriorated appearances, the performances of static image based methods are not satisfactory. To alleviate this issue, many approaches are proposed to leverage temporal information to enhance the feature map at the current frame, improving the detection accuracy. These methods are generally composed of two components: temporal feature aggregation and object detection. In temporal feature aggregation, due to the feature displacement across frames, people need to firstly align features across frames and combine these aligned features into an enhanced feature map. Optical flow, which expresses pixel displacement, is very popular in computer vision field. Considering pixel displacement is not strictly consistent to feature displacement, they apply a common approach: pre-train a neural network to predict optical flow and fine-tune this pre-trained network at the task dataset. With this method, they expect the fine-tuned network to produce tensors encoding feature displacement. However, due to domain gap between the optical flow training dataset and the task dataset, these optical flow based methods cannot accurately align feature maps cross frames, which degrades their performances. Furthermore, the pre-training process requires another dataset with optical flow annotations, which results the training process complex and time-consuming. In the object detection component, existing methods are proposed under full supervision. They require bounding box-level annotations for each frames. Video dataset usually includes hundreds of videos and each video involves hundreds of frames. Constructing such a large-scale video dataset is labor-intensive. To address these issues, three novel methods are proposed in this thesis. For the temporal feature aggregation, we rethink this de facto paradigm (pre-train and fine-tune) and analyze its drawbacks. Based on our analyze, we propose a novel network (IFF-Net) with an In-network Feature Flow estimation module (IFF) to solve these drawbacks. Without resort to pre-training on any additional dataset, our IFF module is able to directly produce feature flow which expresses the feature displacement. Our IFF module consists of a shallow module which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintain a fast inference speed. Additionally, we propose a transformation residual loss (TRL) based on self-supervision, which further improves the performance of our IFF-Net. For the object detection component, we propose a video object detection framework under weak supervision, where only image-level annotations are required and much labor cost is saved. Existing weakly supervised object detection methods are proposed for images. Lacking of box-level annotations, these methods usually cannot accurately locate objects. Considering an object may show different motion from its surrounding objects or background, we leverage motion information to improve the detection accuracy. However, the motion pattern of an object is generally complex in videos, e.g., different parts of the object may have different motion patterns and some object parts may show similar motion as background regions in some scenarios, which poses great challenges in exploring motion information for accurate object localization. Directly using motion information to refine object regions may lead to degraded localization performance. To overcome these issues, we propose a Motion Context Network (MC-Net) which effectively leverages motion information to improve object localization in the weakly supervised setting. Our MC-Net generates motion context features by exploiting neighborhood motion correlation information on moving regions. These motion context features are then incorporated with image information to improve the detection accuracy. Benefited from the motion context information, our weakly supervised method is able to accurately localize objects. Apart from it, weakly supervised object detection methods generally train detectors with a fixed training proposal set, which are dominated by overwhelming negative samples. Without box-level annotations, it is a challenge for these methods to filter out redundant negative samples. This causes the weakly supervised learning procedure may be biased towards negative samples and degrades the detection accuracy. To investigate the effect of the imbalance between positive and negative samples during training, we conduct a series of analysis experiments and find this imbalance heavily hinders the detection accuracy. To alleviate this issue, we propose an Online Active Proposal Set Generation (OPG) algorithm which online generates an active training proposal set according to the prediction of the training model. This active proposal set generated by our OPG maintains a balance between positive and negative samples, which effectively improves the detection accuracy.
author2 Lin Guosheng
author_facet Lin Guosheng
Jin, Ruibing
format Thesis-Doctor of Philosophy
author Jin, Ruibing
author_sort Jin, Ruibing
title Deep neural networks based visual object detection in videos
title_short Deep neural networks based visual object detection in videos
title_full Deep neural networks based visual object detection in videos
title_fullStr Deep neural networks based visual object detection in videos
title_full_unstemmed Deep neural networks based visual object detection in videos
title_sort deep neural networks based visual object detection in videos
publisher Nanyang Technological University
publishDate 2020
url https://hdl.handle.net/10356/144136
_version_ 1772826722499035136
spelling sg-ntu-dr.10356-1441362023-07-04T16:55:04Z Deep neural networks based visual object detection in videos Jin, Ruibing Lin Guosheng Wen Changyun School of Electrical and Electronic Engineering gslin@ntu.edu.sg, ECYWEN@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Engineering::Electrical and electronic engineering As a fundamental task in computer vision, object detection is to locate visual objects of pre-defined classes in instance level. Most existing object detection methods are proposed for static images, which usually try to utilize visual information to accurately locate and recognize an object. It is usually influenced by the object scale, context information, partial occlusion and so on. Different from object detection in static images, video object detection is additionally affected by the deteriorated object appearances, like motion blur and objects under extreme viewpoints. Affected by these deteriorated appearances, the performances of static image based methods are not satisfactory. To alleviate this issue, many approaches are proposed to leverage temporal information to enhance the feature map at the current frame, improving the detection accuracy. These methods are generally composed of two components: temporal feature aggregation and object detection. In temporal feature aggregation, due to the feature displacement across frames, people need to firstly align features across frames and combine these aligned features into an enhanced feature map. Optical flow, which expresses pixel displacement, is very popular in computer vision field. Considering pixel displacement is not strictly consistent to feature displacement, they apply a common approach: pre-train a neural network to predict optical flow and fine-tune this pre-trained network at the task dataset. With this method, they expect the fine-tuned network to produce tensors encoding feature displacement. However, due to domain gap between the optical flow training dataset and the task dataset, these optical flow based methods cannot accurately align feature maps cross frames, which degrades their performances. Furthermore, the pre-training process requires another dataset with optical flow annotations, which results the training process complex and time-consuming. In the object detection component, existing methods are proposed under full supervision. They require bounding box-level annotations for each frames. Video dataset usually includes hundreds of videos and each video involves hundreds of frames. Constructing such a large-scale video dataset is labor-intensive. To address these issues, three novel methods are proposed in this thesis. For the temporal feature aggregation, we rethink this de facto paradigm (pre-train and fine-tune) and analyze its drawbacks. Based on our analyze, we propose a novel network (IFF-Net) with an In-network Feature Flow estimation module (IFF) to solve these drawbacks. Without resort to pre-training on any additional dataset, our IFF module is able to directly produce feature flow which expresses the feature displacement. Our IFF module consists of a shallow module which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintain a fast inference speed. Additionally, we propose a transformation residual loss (TRL) based on self-supervision, which further improves the performance of our IFF-Net. For the object detection component, we propose a video object detection framework under weak supervision, where only image-level annotations are required and much labor cost is saved. Existing weakly supervised object detection methods are proposed for images. Lacking of box-level annotations, these methods usually cannot accurately locate objects. Considering an object may show different motion from its surrounding objects or background, we leverage motion information to improve the detection accuracy. However, the motion pattern of an object is generally complex in videos, e.g., different parts of the object may have different motion patterns and some object parts may show similar motion as background regions in some scenarios, which poses great challenges in exploring motion information for accurate object localization. Directly using motion information to refine object regions may lead to degraded localization performance. To overcome these issues, we propose a Motion Context Network (MC-Net) which effectively leverages motion information to improve object localization in the weakly supervised setting. Our MC-Net generates motion context features by exploiting neighborhood motion correlation information on moving regions. These motion context features are then incorporated with image information to improve the detection accuracy. Benefited from the motion context information, our weakly supervised method is able to accurately localize objects. Apart from it, weakly supervised object detection methods generally train detectors with a fixed training proposal set, which are dominated by overwhelming negative samples. Without box-level annotations, it is a challenge for these methods to filter out redundant negative samples. This causes the weakly supervised learning procedure may be biased towards negative samples and degrades the detection accuracy. To investigate the effect of the imbalance between positive and negative samples during training, we conduct a series of analysis experiments and find this imbalance heavily hinders the detection accuracy. To alleviate this issue, we propose an Online Active Proposal Set Generation (OPG) algorithm which online generates an active training proposal set according to the prediction of the training model. This active proposal set generated by our OPG maintains a balance between positive and negative samples, which effectively improves the detection accuracy. Doctor of Philosophy 2020-10-15T04:49:32Z 2020-10-15T04:49:32Z 2020 Thesis-Doctor of Philosophy Jin, R. (2020). Deep neural networks based visual object detection in videos. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/144136 10.32657/10356/144136 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University