Learning representations for human re-identification
This thesis addresses the problem of Human Re-Identification, the task of associating pedestrians over multiple camera views. Human re-identification particularly an inter- esting problem due to its applications in visual surveillance. Given a probe image of a subject, the objective is to identif...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2017
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/70912 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | This thesis addresses the problem of Human Re-Identification, the task of associating
pedestrians over multiple camera views. Human re-identification particularly an inter-
esting problem due to its applications in visual surveillance. Given a probe image of a
subject, the objective is to identify a set of matching images of the same subject from a
gallery set which are mostly captured by a different camera. Instead of manually search-
ing through images captured by various cameras, it is desirable to automate the human
re-identification as it can save enormous amount of manual labor. However, human
re-identification is fundamentally a challenging problem due to cluttered backgrounds,
ambiguity in visual appearance, variations in illumination, pose and view-point. The goal
of this thesis is to present various feature learning architectures in different perspectives
to tackle the aforementioned challenges in human re-identification.
Public places are equipped with several thousands of surveillance cameras capturing
videos round the clock. Since no biometric aspects such as fingerprint, GAIT or facial
cues are accessible from the surveillance videos, visual appearance is the main cue for re-
identifying pedestrians. Intuitively, to distinguish pedestrians from surveillance videos,
color features can be an important aspect. However, varying illumination and environ-
mental conditions pose a great challenge as the perceived color of the subject may vary.
In existing researches, color features are used as it is, i.e. features are extracted from
raw pixel values or weakly corrected pixels. In the first part of the thesis, an invariant
color feature learning framework is presented to efficiently map and encode the weakly
corrected pixel values in an invariant space where the representations of similar colors
are close to each other.
In the second part of the thesis, contextual information is incorporated into the local
features. Conventional features are extracted locally and independent of other regions.
However, such features lack the global context of the image. Therefore, it is desirable
to incorporate the contextual information to the local features. In order to encode such
information, a variant of the Recurrent Neural Network architecture called Long Short-
Term Memory (LSTM) cells are used. The sophisticated gating mechanisms inside the
LSTM cells has the flexibility to selectively propagate the relevant contextual information
to the rest of the network.
To eliminate the need for hand-crafted features, an end-to-end trainable Siamese
Convolutional Neural Network (S-CNN) architecture is first proposed. However, in con-
ventional S-CNN architectures, the representations of the images are compared only at
the final stage when the feature representations mature. In this setting, the network is
at risk of failing to capture and propagate subtle local patterns that can distinguish pos-
itive pairs from hard-negative pairs. Therefore, a novel gating mechanism is modeled to
selectively boost and propagate such common patterns from the middle layers to the final
layers of the network. Extensive experimental evaluation and comparisons with baseline
algorithms demonstrate the effectiveness of the proposed feature learning models. |
---|