On shoe attribute prediction and retrieval

With the rapid proliferation of the Internet, it has become a great challenge to annotate a large number of objects manually, especially for the fashion domain where a massive collection of new products come up every day. Moreover, the huge profits brought by the online fashion market have motivated...

Full description

Saved in:
Bibliographic Details
Main Author: Zhan, Huijing
Other Authors: Kot, Alex Chichung
Format: Theses and Dissertations
Language:English
Published: 2018
Subjects:
Online Access:http://hdl.handle.net/10356/73509
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:With the rapid proliferation of the Internet, it has become a great challenge to annotate a large number of objects manually, especially for the fashion domain where a massive collection of new products come up every day. Moreover, the huge profits brought by the online fashion market have motivated the multimedia search using the vision-based techniques. Therefore, to save human labor, it is essential to develop an automatic tagging system (i.e., an attribute prediction system) for those fashion products in a variety of appearances. With the annotated attributes, it enables the online shoppers to perform the retrieval of their desired shoes. However, it fails to work on the shoe images from the daily life because their large visual difference from the images of online stores leads to unsatisfactory attribute prediction performance. In this thesis, we study the shoe attribute prediction and retrieval for the online images as well as the daily life photos. More specifically, we address the problem of in-store shoe retrieval as well as the more challenging issue of cross-scenario shoe retrieval guided by the semantic attributes of shoes. The works in this thesis can be summarized as below. An in-store shoe image indexing and retrieval system which allows the query images in the form of multi-view shoe images is firstly proposed here. Given a set of multi-view shoe images, we first identify the viewpoint of each shoe image and a set of relevant view images are selected to estimate the value of each shoe attribute. To effectively predict the attributes which are part-aware, we incorporate the prior knowledge of the shoe structure under a certain viewpoint to learn a novel view-specific part localization model. Similarity, each shoe in the database is also indexed by a list of part-aware shoe attributes. Experimental results demonstrate the effectiveness of the proposed system on a newly-built structured multi-view online shoe dataset. Not limited to the attribute prediction and retrieval for the online store shoe images, we also relax the constraint to perform the same task for the daily life photos with cluttered backgrounds, different scales, varied viewpoints, etc. A novel cross-domain shoe retrieval system is presented which aims to find the exact same shoes given the query daily life shoe photos. More specifically, we propose the Semantic Hierarchy Of attributE Convolutional Neural Network (SHOE-CNN) with a newly designed loss function which systematically merges semantic attributes of closer visual appearances to avoid shoe images with the obvious visual differences being matched wrongly. Moreover, a coarse-to-fine three-level feature representation is developed to effectively match the shoe images across different domains. The experimental results demonstrate the advantages of each component of our proposed system and a significant improvement over other baseline methods. To further enable the retrieval of online store images with different viewpoints and address the failure cases with large viewpoint variation while at the same time improving the capability of differentiating the fine-grained details, we propose the feature embedding for shoes via a multi-task view-invariant convolutional neural network (MTV-CNN), the feature activations of which reflect the inherent similarity between any two shoe images. Specifically, we propose 1) the weighted triplet loss to reduce the feature distance between the same shoes in different scenarios; 2) a novel view invariant loss to reduce ambiguous feature representation from different views; 3) a novel definition of shoe style based on combinations of part-aware semantic shoe attributes and the corresponding style identification loss is presented, and 4) the attribute-based hard negative and anchor images mining process to distinguish fine-grained differences. The experiments conducted on our newly collected dataset indicate that we are capable of not only returning the exact same shoes or similar shoes but also different viewpoint ones.