On shoe attribute prediction and retrieval

With the rapid proliferation of the Internet, it has become a great challenge to annotate a large number of objects manually, especially for the fashion domain where a massive collection of new products come up every day. Moreover, the huge profits brought by the online fashion market have motivated...

Full description

Saved in:
Bibliographic Details
Main Author: Zhan, Huijing
Other Authors: Kot, Alex Chichung
Format: Theses and Dissertations
Language:English
Published: 2018
Subjects:
Online Access:http://hdl.handle.net/10356/73509
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-73509
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Electrical and electronic engineering
spellingShingle DRNTU::Engineering::Electrical and electronic engineering
Zhan, Huijing
On shoe attribute prediction and retrieval
description With the rapid proliferation of the Internet, it has become a great challenge to annotate a large number of objects manually, especially for the fashion domain where a massive collection of new products come up every day. Moreover, the huge profits brought by the online fashion market have motivated the multimedia search using the vision-based techniques. Therefore, to save human labor, it is essential to develop an automatic tagging system (i.e., an attribute prediction system) for those fashion products in a variety of appearances. With the annotated attributes, it enables the online shoppers to perform the retrieval of their desired shoes. However, it fails to work on the shoe images from the daily life because their large visual difference from the images of online stores leads to unsatisfactory attribute prediction performance. In this thesis, we study the shoe attribute prediction and retrieval for the online images as well as the daily life photos. More specifically, we address the problem of in-store shoe retrieval as well as the more challenging issue of cross-scenario shoe retrieval guided by the semantic attributes of shoes. The works in this thesis can be summarized as below. An in-store shoe image indexing and retrieval system which allows the query images in the form of multi-view shoe images is firstly proposed here. Given a set of multi-view shoe images, we first identify the viewpoint of each shoe image and a set of relevant view images are selected to estimate the value of each shoe attribute. To effectively predict the attributes which are part-aware, we incorporate the prior knowledge of the shoe structure under a certain viewpoint to learn a novel view-specific part localization model. Similarity, each shoe in the database is also indexed by a list of part-aware shoe attributes. Experimental results demonstrate the effectiveness of the proposed system on a newly-built structured multi-view online shoe dataset. Not limited to the attribute prediction and retrieval for the online store shoe images, we also relax the constraint to perform the same task for the daily life photos with cluttered backgrounds, different scales, varied viewpoints, etc. A novel cross-domain shoe retrieval system is presented which aims to find the exact same shoes given the query daily life shoe photos. More specifically, we propose the Semantic Hierarchy Of attributE Convolutional Neural Network (SHOE-CNN) with a newly designed loss function which systematically merges semantic attributes of closer visual appearances to avoid shoe images with the obvious visual differences being matched wrongly. Moreover, a coarse-to-fine three-level feature representation is developed to effectively match the shoe images across different domains. The experimental results demonstrate the advantages of each component of our proposed system and a significant improvement over other baseline methods. To further enable the retrieval of online store images with different viewpoints and address the failure cases with large viewpoint variation while at the same time improving the capability of differentiating the fine-grained details, we propose the feature embedding for shoes via a multi-task view-invariant convolutional neural network (MTV-CNN), the feature activations of which reflect the inherent similarity between any two shoe images. Specifically, we propose 1) the weighted triplet loss to reduce the feature distance between the same shoes in different scenarios; 2) a novel view invariant loss to reduce ambiguous feature representation from different views; 3) a novel definition of shoe style based on combinations of part-aware semantic shoe attributes and the corresponding style identification loss is presented, and 4) the attribute-based hard negative and anchor images mining process to distinguish fine-grained differences. The experiments conducted on our newly collected dataset indicate that we are capable of not only returning the exact same shoes or similar shoes but also different viewpoint ones.
author2 Kot, Alex Chichung
author_facet Kot, Alex Chichung
Zhan, Huijing
format Theses and Dissertations
author Zhan, Huijing
author_sort Zhan, Huijing
title On shoe attribute prediction and retrieval
title_short On shoe attribute prediction and retrieval
title_full On shoe attribute prediction and retrieval
title_fullStr On shoe attribute prediction and retrieval
title_full_unstemmed On shoe attribute prediction and retrieval
title_sort on shoe attribute prediction and retrieval
publishDate 2018
url http://hdl.handle.net/10356/73509
_version_ 1772826793679519744
spelling sg-ntu-dr.10356-735092023-07-04T17:30:03Z On shoe attribute prediction and retrieval Zhan, Huijing Kot, Alex Chichung School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering With the rapid proliferation of the Internet, it has become a great challenge to annotate a large number of objects manually, especially for the fashion domain where a massive collection of new products come up every day. Moreover, the huge profits brought by the online fashion market have motivated the multimedia search using the vision-based techniques. Therefore, to save human labor, it is essential to develop an automatic tagging system (i.e., an attribute prediction system) for those fashion products in a variety of appearances. With the annotated attributes, it enables the online shoppers to perform the retrieval of their desired shoes. However, it fails to work on the shoe images from the daily life because their large visual difference from the images of online stores leads to unsatisfactory attribute prediction performance. In this thesis, we study the shoe attribute prediction and retrieval for the online images as well as the daily life photos. More specifically, we address the problem of in-store shoe retrieval as well as the more challenging issue of cross-scenario shoe retrieval guided by the semantic attributes of shoes. The works in this thesis can be summarized as below. An in-store shoe image indexing and retrieval system which allows the query images in the form of multi-view shoe images is firstly proposed here. Given a set of multi-view shoe images, we first identify the viewpoint of each shoe image and a set of relevant view images are selected to estimate the value of each shoe attribute. To effectively predict the attributes which are part-aware, we incorporate the prior knowledge of the shoe structure under a certain viewpoint to learn a novel view-specific part localization model. Similarity, each shoe in the database is also indexed by a list of part-aware shoe attributes. Experimental results demonstrate the effectiveness of the proposed system on a newly-built structured multi-view online shoe dataset. Not limited to the attribute prediction and retrieval for the online store shoe images, we also relax the constraint to perform the same task for the daily life photos with cluttered backgrounds, different scales, varied viewpoints, etc. A novel cross-domain shoe retrieval system is presented which aims to find the exact same shoes given the query daily life shoe photos. More specifically, we propose the Semantic Hierarchy Of attributE Convolutional Neural Network (SHOE-CNN) with a newly designed loss function which systematically merges semantic attributes of closer visual appearances to avoid shoe images with the obvious visual differences being matched wrongly. Moreover, a coarse-to-fine three-level feature representation is developed to effectively match the shoe images across different domains. The experimental results demonstrate the advantages of each component of our proposed system and a significant improvement over other baseline methods. To further enable the retrieval of online store images with different viewpoints and address the failure cases with large viewpoint variation while at the same time improving the capability of differentiating the fine-grained details, we propose the feature embedding for shoes via a multi-task view-invariant convolutional neural network (MTV-CNN), the feature activations of which reflect the inherent similarity between any two shoe images. Specifically, we propose 1) the weighted triplet loss to reduce the feature distance between the same shoes in different scenarios; 2) a novel view invariant loss to reduce ambiguous feature representation from different views; 3) a novel definition of shoe style based on combinations of part-aware semantic shoe attributes and the corresponding style identification loss is presented, and 4) the attribute-based hard negative and anchor images mining process to distinguish fine-grained differences. The experiments conducted on our newly collected dataset indicate that we are capable of not only returning the exact same shoes or similar shoes but also different viewpoint ones. Doctor of Philosophy (EEE) 2018-03-22T07:11:36Z 2018-03-22T07:11:36Z 2018 Thesis Zhan, H. (2018). On shoe attribute prediction and retrieval. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/73509 10.32657/10356/73509 en 165 p. application/pdf