Distance metric learning for multi-modal image retrieval and annotation

With the rapid growth of digital cameras and photo sharing websites, content-based image retrieval (CBIR) and search-based image annotation are important techniques for many real-world multimedia applications. They remain open challenges today, despite being studied extensively for a few decades in...

Full description

Saved in:

Bibliographic Details
Main Author:	Wu, Pengcheng
Other Authors:	Hoi Chu Hong
Format:	Theses and Dissertations
Language:	English
Published:	2014
Subjects:	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Online Access:	http://hdl.handle.net/10356/60499
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-60499
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Wu, Pengcheng Distance metric learning for multi-modal image retrieval and annotation
description	With the rapid growth of digital cameras and photo sharing websites, content-based image retrieval (CBIR) and search-based image annotation are important techniques for many real-world multimedia applications. They remain open challenges today, despite being studied extensively for a few decades in several communities, including multimedia, signal processing, and computer vision. One key challenge of CBIR is to find an effective similarity search scheme to accurately retrieve a short list of most similar images from a massive collection of images. The conventional CBIR approaches usually adopt rigid measures to evaluate similarity of images, such as the classical Euclidean distance or cosine similarity, which are often limited despite being widely used in many applications. In this thesis, we investigate Distance Metric Learning (DML) techniques to improve visual similarity search in multimedia information retrieval tasks. In particular, we propose three kinds of novel machine learning algorithms to tackle the challenges of content-based image retrieval and search-based image annotation. Firstly, we present a novel Unified Distance Metric Learning (UDML) scheme for mining social images towards automated image annotation. To effectively discover knowledge from social images that are often associated with multimedia contents (including visual images and textual tags), UDML not only exploits both visual and textual contents of social images, but also effectively unifies both inductive and transductive metric learning techniques in a systematic learning framework. The UDML task is formulated as a convex optimization problem, i.e., a Semi-Definite Program (SDP) which is in general difficult to solve. To overcome the challenging optimization task of UDML, we develop an efficient stochastic gradient descent algorithm for solving the optimization task and prove the convergence of the proposed algorithm. By applying the UDML technique to the search-based image annotation task on a large real-world testbed in our experiments, we demonstrate that the proposed algorithm is empirically promising for mining social images towards real applications. Secondly, we investigate a novel scheme of online multi-modal distance metric learning (OMDML), which aims to learn distance metrics from multi-modal data or multiple types of features via an efficient and scalable online learning scheme. Unlike the traditional DML approaches, which typically adopt a single-modal metric learning framework that learns the distance metric on either a single type of feature or a combined feature space where multiple different types of features are simply concatenated together, OMDML explores a unified two-level online learning scheme which: (i) learns to optimize distance metric on each individual feature space; and (ii) then learns to find the optimal combination of multiple diverse types of features. To further reduce the expensive cost of DML on high-dimensional feature space, we propose a low-rank OMDML algorithm which not only significantly reduces the computational cost but also retains highly competing or even better learning accuracy. We conduct extensive experiments to evaluate the performance of the proposed algorithms for multi-modal image retrieval, in which encouraging results validate the effectiveness of the proposed technique. Finally, we propose a novel framework of online multi-modal deep similarity learning (OMDSL), which exploits emerging deep learning techniques to learn a flexible nonlinear similarity function from images of multi-modal feature representation. The previous OMDML attempts to learn a linear distance function on the input feature space, in which the assumption of linearity limits the capacity of measuring the similarity on complex patterns in real-world applications. To address this limitation, OMDSL explores a unified two-stage online learning scheme that consists of (i) learning a flexible nonlinear transformation function for each individual modality, and (ii) learning to find the optimal combination of multiple diverse modalities simultaneously in a coherent process. We evaluate the proposed technique for multi-modal image retrieval tasks on a variety of image data sets, in which encouraging results show that OMDSL outperforms the previous techniques significantly.
author2	Hoi Chu Hong
author_facet	Hoi Chu Hong Wu, Pengcheng
format	Theses and Dissertations
author	Wu, Pengcheng
author_sort	Wu, Pengcheng
title	Distance metric learning for multi-modal image retrieval and annotation
title_short	Distance metric learning for multi-modal image retrieval and annotation
title_full	Distance metric learning for multi-modal image retrieval and annotation
title_fullStr	Distance metric learning for multi-modal image retrieval and annotation
title_full_unstemmed	Distance metric learning for multi-modal image retrieval and annotation
title_sort	distance metric learning for multi-modal image retrieval and annotation
publishDate	2014
url	http://hdl.handle.net/10356/60499
_version_	1759855454569103360
spelling	sg-ntu-dr.10356-604992023-03-04T00:33:25Z Distance metric learning for multi-modal image retrieval and annotation Wu, Pengcheng Hoi Chu Hong School of Computer Engineering Centre for Computational Intelligence DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence With the rapid growth of digital cameras and photo sharing websites, content-based image retrieval (CBIR) and search-based image annotation are important techniques for many real-world multimedia applications. They remain open challenges today, despite being studied extensively for a few decades in several communities, including multimedia, signal processing, and computer vision. One key challenge of CBIR is to find an effective similarity search scheme to accurately retrieve a short list of most similar images from a massive collection of images. The conventional CBIR approaches usually adopt rigid measures to evaluate similarity of images, such as the classical Euclidean distance or cosine similarity, which are often limited despite being widely used in many applications. In this thesis, we investigate Distance Metric Learning (DML) techniques to improve visual similarity search in multimedia information retrieval tasks. In particular, we propose three kinds of novel machine learning algorithms to tackle the challenges of content-based image retrieval and search-based image annotation. Firstly, we present a novel Unified Distance Metric Learning (UDML) scheme for mining social images towards automated image annotation. To effectively discover knowledge from social images that are often associated with multimedia contents (including visual images and textual tags), UDML not only exploits both visual and textual contents of social images, but also effectively unifies both inductive and transductive metric learning techniques in a systematic learning framework. The UDML task is formulated as a convex optimization problem, i.e., a Semi-Definite Program (SDP) which is in general difficult to solve. To overcome the challenging optimization task of UDML, we develop an efficient stochastic gradient descent algorithm for solving the optimization task and prove the convergence of the proposed algorithm. By applying the UDML technique to the search-based image annotation task on a large real-world testbed in our experiments, we demonstrate that the proposed algorithm is empirically promising for mining social images towards real applications. Secondly, we investigate a novel scheme of online multi-modal distance metric learning (OMDML), which aims to learn distance metrics from multi-modal data or multiple types of features via an efficient and scalable online learning scheme. Unlike the traditional DML approaches, which typically adopt a single-modal metric learning framework that learns the distance metric on either a single type of feature or a combined feature space where multiple different types of features are simply concatenated together, OMDML explores a unified two-level online learning scheme which: (i) learns to optimize distance metric on each individual feature space; and (ii) then learns to find the optimal combination of multiple diverse types of features. To further reduce the expensive cost of DML on high-dimensional feature space, we propose a low-rank OMDML algorithm which not only significantly reduces the computational cost but also retains highly competing or even better learning accuracy. We conduct extensive experiments to evaluate the performance of the proposed algorithms for multi-modal image retrieval, in which encouraging results validate the effectiveness of the proposed technique. Finally, we propose a novel framework of online multi-modal deep similarity learning (OMDSL), which exploits emerging deep learning techniques to learn a flexible nonlinear similarity function from images of multi-modal feature representation. The previous OMDML attempts to learn a linear distance function on the input feature space, in which the assumption of linearity limits the capacity of measuring the similarity on complex patterns in real-world applications. To address this limitation, OMDSL explores a unified two-stage online learning scheme that consists of (i) learning a flexible nonlinear transformation function for each individual modality, and (ii) learning to find the optimal combination of multiple diverse modalities simultaneously in a coherent process. We evaluate the proposed technique for multi-modal image retrieval tasks on a variety of image data sets, in which encouraging results show that OMDSL outperforms the previous techniques significantly. Doctor of Philosophy (SCE) 2014-05-27T08:45:31Z 2014-05-27T08:45:31Z 2014 2014 Thesis http://hdl.handle.net/10356/60499 en 163 p. application/pdf

Distance metric learning for multi-modal image retrieval and annotation

Similar Items