Distance metric learning for multi-modal image retrieval and annotation
With the rapid growth of digital cameras and photo sharing websites, content-based image retrieval (CBIR) and search-based image annotation are important techniques for many real-world multimedia applications. They remain open challenges today, despite being studied extensively for a few decades in...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2014
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/60499 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-60499 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence |
spellingShingle |
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Wu, Pengcheng Distance metric learning for multi-modal image retrieval and annotation |
description |
With the rapid growth of digital cameras and photo sharing websites, content-based image retrieval (CBIR) and search-based image annotation are important techniques for many real-world multimedia applications. They remain open challenges today, despite being studied extensively for a few decades in several communities, including multimedia, signal processing, and computer vision. One key challenge of CBIR is to find an effective similarity search scheme to accurately retrieve a short list of most similar images from a massive collection of images. The conventional CBIR approaches usually adopt rigid measures to evaluate similarity of images, such as the classical Euclidean distance or cosine similarity, which are often limited despite being widely used in many applications. In this thesis, we investigate Distance Metric Learning (DML) techniques to improve visual similarity search in multimedia information retrieval tasks. In particular, we propose three kinds of novel machine learning algorithms to tackle the challenges of content-based image retrieval and search-based image annotation.
Firstly, we present a novel Unified Distance Metric Learning (UDML) scheme for mining social images towards automated image annotation. To effectively discover knowledge from social images that are often associated with multimedia contents (including visual images and textual tags), UDML not only exploits both visual and textual contents of social images, but also effectively unifies both inductive and transductive metric learning techniques in a systematic learning framework. The UDML task is formulated as a convex optimization problem, i.e., a Semi-Definite Program (SDP) which is in general difficult to solve. To overcome the challenging optimization task of UDML, we develop an efficient stochastic gradient descent algorithm for solving the optimization task and prove the convergence of the proposed algorithm. By applying the UDML technique to the search-based image annotation task on a large real-world testbed in our experiments, we demonstrate that the proposed algorithm is empirically promising for mining social images towards real applications.
Secondly, we investigate a novel scheme of online multi-modal distance metric learning (OMDML), which aims to learn distance metrics from multi-modal data or multiple types of features via an efficient and scalable online learning scheme. Unlike the traditional DML approaches, which typically adopt a single-modal metric learning framework that learns the distance metric on either a single type of feature or a combined feature space where multiple different types of features are simply concatenated together, OMDML explores a unified two-level online learning scheme which: (i) learns to optimize distance metric on each individual feature space; and (ii) then learns to find the optimal combination of multiple diverse types of features. To further reduce the expensive cost of DML on high-dimensional feature space, we propose a low-rank OMDML algorithm which not only significantly reduces the computational cost but also retains highly competing or even better learning accuracy. We conduct extensive experiments to evaluate the performance of the proposed algorithms for multi-modal image retrieval, in which encouraging results validate the effectiveness of the proposed technique.
Finally, we propose a novel framework of online multi-modal deep similarity learning (OMDSL), which exploits emerging deep learning techniques to learn a flexible nonlinear similarity function from images of multi-modal feature representation. The previous OMDML attempts to learn a linear distance function on the input feature space, in which the assumption of linearity limits the capacity of measuring the similarity on complex patterns in real-world applications. To address this limitation, OMDSL explores a unified two-stage online learning scheme that consists of (i) learning a flexible nonlinear transformation function for each individual modality, and (ii) learning to find the optimal combination of multiple diverse modalities simultaneously in a coherent process. We evaluate the proposed technique for multi-modal image retrieval tasks on a variety of image data sets, in which encouraging results show that OMDSL outperforms the previous techniques significantly. |
author2 |
Hoi Chu Hong |
author_facet |
Hoi Chu Hong Wu, Pengcheng |
format |
Theses and Dissertations |
author |
Wu, Pengcheng |
author_sort |
Wu, Pengcheng |
title |
Distance metric learning for multi-modal image retrieval and annotation |
title_short |
Distance metric learning for multi-modal image retrieval and annotation |
title_full |
Distance metric learning for multi-modal image retrieval and annotation |
title_fullStr |
Distance metric learning for multi-modal image retrieval and annotation |
title_full_unstemmed |
Distance metric learning for multi-modal image retrieval and annotation |
title_sort |
distance metric learning for multi-modal image retrieval and annotation |
publishDate |
2014 |
url |
http://hdl.handle.net/10356/60499 |
_version_ |
1759855454569103360 |
spelling |
sg-ntu-dr.10356-604992023-03-04T00:33:25Z Distance metric learning for multi-modal image retrieval and annotation Wu, Pengcheng Hoi Chu Hong School of Computer Engineering Centre for Computational Intelligence DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence With the rapid growth of digital cameras and photo sharing websites, content-based image retrieval (CBIR) and search-based image annotation are important techniques for many real-world multimedia applications. They remain open challenges today, despite being studied extensively for a few decades in several communities, including multimedia, signal processing, and computer vision. One key challenge of CBIR is to find an effective similarity search scheme to accurately retrieve a short list of most similar images from a massive collection of images. The conventional CBIR approaches usually adopt rigid measures to evaluate similarity of images, such as the classical Euclidean distance or cosine similarity, which are often limited despite being widely used in many applications. In this thesis, we investigate Distance Metric Learning (DML) techniques to improve visual similarity search in multimedia information retrieval tasks. In particular, we propose three kinds of novel machine learning algorithms to tackle the challenges of content-based image retrieval and search-based image annotation. Firstly, we present a novel Unified Distance Metric Learning (UDML) scheme for mining social images towards automated image annotation. To effectively discover knowledge from social images that are often associated with multimedia contents (including visual images and textual tags), UDML not only exploits both visual and textual contents of social images, but also effectively unifies both inductive and transductive metric learning techniques in a systematic learning framework. The UDML task is formulated as a convex optimization problem, i.e., a Semi-Definite Program (SDP) which is in general difficult to solve. To overcome the challenging optimization task of UDML, we develop an efficient stochastic gradient descent algorithm for solving the optimization task and prove the convergence of the proposed algorithm. By applying the UDML technique to the search-based image annotation task on a large real-world testbed in our experiments, we demonstrate that the proposed algorithm is empirically promising for mining social images towards real applications. Secondly, we investigate a novel scheme of online multi-modal distance metric learning (OMDML), which aims to learn distance metrics from multi-modal data or multiple types of features via an efficient and scalable online learning scheme. Unlike the traditional DML approaches, which typically adopt a single-modal metric learning framework that learns the distance metric on either a single type of feature or a combined feature space where multiple different types of features are simply concatenated together, OMDML explores a unified two-level online learning scheme which: (i) learns to optimize distance metric on each individual feature space; and (ii) then learns to find the optimal combination of multiple diverse types of features. To further reduce the expensive cost of DML on high-dimensional feature space, we propose a low-rank OMDML algorithm which not only significantly reduces the computational cost but also retains highly competing or even better learning accuracy. We conduct extensive experiments to evaluate the performance of the proposed algorithms for multi-modal image retrieval, in which encouraging results validate the effectiveness of the proposed technique. Finally, we propose a novel framework of online multi-modal deep similarity learning (OMDSL), which exploits emerging deep learning techniques to learn a flexible nonlinear similarity function from images of multi-modal feature representation. The previous OMDML attempts to learn a linear distance function on the input feature space, in which the assumption of linearity limits the capacity of measuring the similarity on complex patterns in real-world applications. To address this limitation, OMDSL explores a unified two-stage online learning scheme that consists of (i) learning a flexible nonlinear transformation function for each individual modality, and (ii) learning to find the optimal combination of multiple diverse modalities simultaneously in a coherent process. We evaluate the proposed technique for multi-modal image retrieval tasks on a variety of image data sets, in which encouraging results show that OMDSL outperforms the previous techniques significantly. Doctor of Philosophy (SCE) 2014-05-27T08:45:31Z 2014-05-27T08:45:31Z 2014 2014 Thesis http://hdl.handle.net/10356/60499 en 163 p. application/pdf |