Tracking and detecting objects in image sequence

Tracking and detecting arbitrary objects are important in many applications such as video surveillance, video analytics and human-machine interactions. Although many promising methods have been proposed in this area, it is still very challenging to track and detect arbitrary objects due to issues su...

Full description

Saved in:
Bibliographic Details
Main Author: Wang, Li
Other Authors: Wang Gang
Format: Theses and Dissertations
Language:English
Published: 2016
Subjects:
Online Access:https://hdl.handle.net/10356/65878
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Tracking and detecting arbitrary objects are important in many applications such as video surveillance, video analytics and human-machine interactions. Although many promising methods have been proposed in this area, it is still very challenging to track and detect arbitrary objects due to issues such as complicated motion transformations and occlusions. In this thesis, four pieces of works are developed to address these problems in tracking and detecting objects. The first piece of work addresses learning hierarchical features for visual object tracking by using deep learning. Previously, raw pixel values or hand-crafted features are used to represent target objects. However, these representations are not able to handle large appearance variations of arbitrary target objects. Recently, deep learning has achieved very promising results in speech recognition and image classification. Nevertheless, it is non-trivial to apply deep learning to visual object tracking. Usually, deep neural networks require a lot of training data to learn a large number of network parameters. However, training data is not sufficient for visual object tracking as annotations of a target object are only available in the first frame of a test sequence. To solve this problem, a feature learning algorithm is proposed for visual object tracking by using domain adaptation. First, hierarchical features are learned from auxiliary video sequences by using a two-layer neural network. Embedding the temporal slowness constraint into the stacked network architecture makes learned features robust to complicated motion transformations, and this manipulation is important for visual object tracking. Then, given a target image sequence, a domain adaptation module is proposed to adapt the pre-learned features according to the specific target object. The adaptation is conducted in both two layers of the neural network to include appearance information of the specific target object. As a result, learned hierarchical features can be robust to both complicated motion transformations and appearance changes of target objects. Experimental results demonstrate that significant improvement can be achieved by using the learned hierarchical features, especially on image sequences with complicated motion transformations. As an extension of the first work, the second piece of work focuses on learning hierarchical features for multiple object tracking. First, generic features can be pre-learned from auxiliary data. Then, it is straightforward to adapt the pre-learned features according to multiple target objects independently. In this way, the adaptation module for each target object will only use its own annotation data, which is actually still limited for feature adaptation. To solve this problem, a joint learning strategy using the annotation data of all target objects is proposed to conduct feature adaptation jointly according to multiple target objects. As a result, the feature adaptation module can achieve better performance since it makes use of more training data. The proposed joint learning strategy can simultaneously learn common features shared by all target objects and individual features for each object. Experimental results demonstrate that using learned hierarchical features can significantly improve multiple object tracking performance. The third piece of work is about learning hierarchical features for visual object tracking by using another neural network different from the one used in the first and second works. As mentioned before, training data is not enough in visual object tracking. Consequently, the deep neural networks used in the first and second works pre-learn the network parameters from auxiliary data. However, this kind of pretraining is inconvenient for visual object tracking. In this work, a feature learning algorithm is proposed for visual object tracking to learn hierarchical features by using tree structure based Recursive Neural Networks (RNN), which have fewer parameters and do not require any network pre-training on auxiliary data. First, RNN parameters are learned to discriminate between target and background only based on the target annotation in the first frame of a test sequence. Tree structure over local patches of an exemplar region is randomly generated by using a bottom-up greedy search strategy. Given the learned RNN parameters, two dictionaries regarding target regions and corresponding local patches are created based on learned hierarchical features from both top and leaf nodes of multiple random trees. In each of the subsequent frames, sparse dictionary coding is conducted on all candidates to select the best candidate as the new target location. In addition, two dictionaries are online updated to handle appearance changes of target objects. Experimental results demonstrate that the proposed feature learning algorithm can significantly improve tracking performance on benchmark datasets. The last piece of work investigates background subtraction based object detection, in which foreground regions are obtained by subtracting background models from input images. Sometimes, there are more than one object in a foreground region. However, it is difficult to separate the occluded objects in a foreground region based on color images. In this work, an occlusion handling method is proposed for object detection to separate occluded objects in foreground regions based on “3-D” data comprising image plane coordinates and depth values. First, foreground regions are obtained by using background subtraction on depth data. Then, a so-called “split-merge” approach is proposed to over-segment a foreground region into subregions and then cluster them into a number of object regions by using the proposed boundary based similarity metric. As a result, occluded objects in a foreground region can be effectively detected. Experimental results demonstrate that the proposed occlusion handling method can significantly improve object detection performance.