Distance metric learning for multi-modal image retrieval and annotation

With the rapid growth of digital cameras and photo sharing websites, content-based image retrieval (CBIR) and search-based image annotation are important techniques for many real-world multimedia applications. They remain open challenges today, despite being studied extensively for a few decades in...

Full description

Saved in:
Bibliographic Details
Main Author: Wu, Pengcheng
Other Authors: Hoi Chu Hong
Format: Theses and Dissertations
Language:English
Published: 2014
Subjects:
Online Access:http://hdl.handle.net/10356/60499
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-60499
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Wu, Pengcheng
Distance metric learning for multi-modal image retrieval and annotation
description With the rapid growth of digital cameras and photo sharing websites, content-based image retrieval (CBIR) and search-based image annotation are important techniques for many real-world multimedia applications. They remain open challenges today, despite being studied extensively for a few decades in several communities, including multimedia, signal processing, and computer vision. One key challenge of CBIR is to find an effective similarity search scheme to accurately retrieve a short list of most similar images from a massive collection of images. The conventional CBIR approaches usually adopt rigid measures to evaluate similarity of images, such as the classical Euclidean distance or cosine similarity, which are often limited despite being widely used in many applications. In this thesis, we investigate Distance Metric Learning (DML) techniques to improve visual similarity search in multimedia information retrieval tasks. In particular, we propose three kinds of novel machine learning algorithms to tackle the challenges of content-based image retrieval and search-based image annotation. Firstly, we present a novel Unified Distance Metric Learning (UDML) scheme for mining social images towards automated image annotation. To effectively discover knowledge from social images that are often associated with multimedia contents (including visual images and textual tags), UDML not only exploits both visual and textual contents of social images, but also effectively unifies both inductive and transductive metric learning techniques in a systematic learning framework. The UDML task is formulated as a convex optimization problem, i.e., a Semi-Definite Program (SDP) which is in general difficult to solve. To overcome the challenging optimization task of UDML, we develop an efficient stochastic gradient descent algorithm for solving the optimization task and prove the convergence of the proposed algorithm. By applying the UDML technique to the search-based image annotation task on a large real-world testbed in our experiments, we demonstrate that the proposed algorithm is empirically promising for mining social images towards real applications. Secondly, we investigate a novel scheme of online multi-modal distance metric learning (OMDML), which aims to learn distance metrics from multi-modal data or multiple types of features via an efficient and scalable online learning scheme. Unlike the traditional DML approaches, which typically adopt a single-modal metric learning framework that learns the distance metric on either a single type of feature or a combined feature space where multiple different types of features are simply concatenated together, OMDML explores a unified two-level online learning scheme which: (i) learns to optimize distance metric on each individual feature space; and (ii) then learns to find the optimal combination of multiple diverse types of features. To further reduce the expensive cost of DML on high-dimensional feature space, we propose a low-rank OMDML algorithm which not only significantly reduces the computational cost but also retains highly competing or even better learning accuracy. We conduct extensive experiments to evaluate the performance of the proposed algorithms for multi-modal image retrieval, in which encouraging results validate the effectiveness of the proposed technique. Finally, we propose a novel framework of online multi-modal deep similarity learning (OMDSL), which exploits emerging deep learning techniques to learn a flexible nonlinear similarity function from images of multi-modal feature representation. The previous OMDML attempts to learn a linear distance function on the input feature space, in which the assumption of linearity limits the capacity of measuring the similarity on complex patterns in real-world applications. To address this limitation, OMDSL explores a unified two-stage online learning scheme that consists of (i) learning a flexible nonlinear transformation function for each individual modality, and (ii) learning to find the optimal combination of multiple diverse modalities simultaneously in a coherent process. We evaluate the proposed technique for multi-modal image retrieval tasks on a variety of image data sets, in which encouraging results show that OMDSL outperforms the previous techniques significantly.
author2 Hoi Chu Hong
author_facet Hoi Chu Hong
Wu, Pengcheng
format Theses and Dissertations
author Wu, Pengcheng
author_sort Wu, Pengcheng
title Distance metric learning for multi-modal image retrieval and annotation
title_short Distance metric learning for multi-modal image retrieval and annotation
title_full Distance metric learning for multi-modal image retrieval and annotation
title_fullStr Distance metric learning for multi-modal image retrieval and annotation
title_full_unstemmed Distance metric learning for multi-modal image retrieval and annotation
title_sort distance metric learning for multi-modal image retrieval and annotation
publishDate 2014
url http://hdl.handle.net/10356/60499
_version_ 1759855454569103360
spelling sg-ntu-dr.10356-604992023-03-04T00:33:25Z Distance metric learning for multi-modal image retrieval and annotation Wu, Pengcheng Hoi Chu Hong School of Computer Engineering Centre for Computational Intelligence DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence With the rapid growth of digital cameras and photo sharing websites, content-based image retrieval (CBIR) and search-based image annotation are important techniques for many real-world multimedia applications. They remain open challenges today, despite being studied extensively for a few decades in several communities, including multimedia, signal processing, and computer vision. One key challenge of CBIR is to find an effective similarity search scheme to accurately retrieve a short list of most similar images from a massive collection of images. The conventional CBIR approaches usually adopt rigid measures to evaluate similarity of images, such as the classical Euclidean distance or cosine similarity, which are often limited despite being widely used in many applications. In this thesis, we investigate Distance Metric Learning (DML) techniques to improve visual similarity search in multimedia information retrieval tasks. In particular, we propose three kinds of novel machine learning algorithms to tackle the challenges of content-based image retrieval and search-based image annotation. Firstly, we present a novel Unified Distance Metric Learning (UDML) scheme for mining social images towards automated image annotation. To effectively discover knowledge from social images that are often associated with multimedia contents (including visual images and textual tags), UDML not only exploits both visual and textual contents of social images, but also effectively unifies both inductive and transductive metric learning techniques in a systematic learning framework. The UDML task is formulated as a convex optimization problem, i.e., a Semi-Definite Program (SDP) which is in general difficult to solve. To overcome the challenging optimization task of UDML, we develop an efficient stochastic gradient descent algorithm for solving the optimization task and prove the convergence of the proposed algorithm. By applying the UDML technique to the search-based image annotation task on a large real-world testbed in our experiments, we demonstrate that the proposed algorithm is empirically promising for mining social images towards real applications. Secondly, we investigate a novel scheme of online multi-modal distance metric learning (OMDML), which aims to learn distance metrics from multi-modal data or multiple types of features via an efficient and scalable online learning scheme. Unlike the traditional DML approaches, which typically adopt a single-modal metric learning framework that learns the distance metric on either a single type of feature or a combined feature space where multiple different types of features are simply concatenated together, OMDML explores a unified two-level online learning scheme which: (i) learns to optimize distance metric on each individual feature space; and (ii) then learns to find the optimal combination of multiple diverse types of features. To further reduce the expensive cost of DML on high-dimensional feature space, we propose a low-rank OMDML algorithm which not only significantly reduces the computational cost but also retains highly competing or even better learning accuracy. We conduct extensive experiments to evaluate the performance of the proposed algorithms for multi-modal image retrieval, in which encouraging results validate the effectiveness of the proposed technique. Finally, we propose a novel framework of online multi-modal deep similarity learning (OMDSL), which exploits emerging deep learning techniques to learn a flexible nonlinear similarity function from images of multi-modal feature representation. The previous OMDML attempts to learn a linear distance function on the input feature space, in which the assumption of linearity limits the capacity of measuring the similarity on complex patterns in real-world applications. To address this limitation, OMDSL explores a unified two-stage online learning scheme that consists of (i) learning a flexible nonlinear transformation function for each individual modality, and (ii) learning to find the optimal combination of multiple diverse modalities simultaneously in a coherent process. We evaluate the proposed technique for multi-modal image retrieval tasks on a variety of image data sets, in which encouraging results show that OMDSL outperforms the previous techniques significantly. Doctor of Philosophy (SCE) 2014-05-27T08:45:31Z 2014-05-27T08:45:31Z 2014 2014 Thesis http://hdl.handle.net/10356/60499 en 163 p. application/pdf