Web text-aided image classification

Image classification is often solved as a machine-learning problem, where a classifier is first learned from training data, and class labels are then assigned to unlabeled testing data based on the outputs of the classifier. To train an image classifier with good generalization capability, conventio...

Full description

Saved in:
Bibliographic Details
Main Author: Wang, Dongzhe
Other Authors: Mao Kezhi
Format: Theses and Dissertations
Language:English
Published: 2019
Subjects:
Online Access:https://hdl.handle.net/10356/105855
http://hdl.handle.net/10220/47855
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Image classification is often solved as a machine-learning problem, where a classifier is first learned from training data, and class labels are then assigned to unlabeled testing data based on the outputs of the classifier. To train an image classifier with good generalization capability, conventional methods often require a large number of human-labeled training images. However, a large number of well-labeled training images may not always be available. With the exponential growth of web data, exploiting multimodal online sources via the standard search engine has become a trend in visual recognition as it can effectively alleviate the shortage of training data. However, web data such as text data is often not cooperative due to its unstructured and noisy nature. Therefore, how to represent and utilize the web text data to aid image classification is chosen as the focus of this thesis. Since target image data and web text data are usually from different domains whose representations are in the different feature space, we firstly investigate the two modalities of data separately and then combine the bimodal information in decision level. In particular, low-level text modeling approaches including class tag occurrence and bag-of-words vectorization and image modeling approaches such as dense SIFT are employed to learn separate classifiers, whose decision scores are aggregated adaptively. On the other hand, we believe that the correlation information between image modality and web text modality is also very important. In order to explore the cross-modal correlation, we also investigate feature-level multimodal fusion models in this PhD thesis. Learning dense and real-valued text representation in a similar manner of learning image representation is the keystone of feature-level multimodal fusion. In this thesis, we propose the novel task-specific semantic matching network and task-generic semantic convolutional network models to learn semantic text features. These proposed text feature learning methods are motivated by the transferable mid-level image representation learned by the convolutional neural network (CNN). Besides traditional supervised learning setting, we find that the web text-aided strategy also makes difference in weakly supervised setting when only little labeled data is available. Specially, we investigate web text-aided one-shot learning that is able to identify unlabeled data from novel classes based on single observation using an adaptive attention mechanism. This thesis is organized as follows. Chapter 1 introduces the motivation behind the web resources-aided image classification. Chapter 2 reviews the related works in this field, including image representation learning, text representation learning and multimodal fusion learning. Chapter 3 investigates the decision-level data fusion for web-aided image classification. An adaptive combiner for two separate bimodal classifiers is developed in decision level. This adaptive fusion algorithm is inspired by the multisensory integration mechanism of the human. And the adaptability is achieved by a reliability-dependent weighting of different sensory modalities. In Chapter 4, a novel text modeling namely the semantic matching neural network (SMNN) is proposed, which is quantified by cosine similarity measures between embedded text input and task-specific semantic filters. It is capable of learning semantic features from the associated text of web images. The SMNN text features have improved reliability and applicability, compared to the text features obtained from other methods. Then, the SMNN text features and convolutional neural network visual features are jointly learned in a shared representation, which aims to capture the correlations between the two modalities in the feature level. Improving upon task-specific filters for SMNN, Chapter 5 presents a novel semantic CNN (s-CNN) model for high-level text representation learning to encode semantic correlation based on task-generic semantic filters. However, the s-CNN model inevitably brings about surplus semantic filters to achieve better applicability and generalization in universal tasks. Moreover, the surplus filters may lead to semantic overlaps and feature redundancy issue. To address this issue, the s-CNN Clustered (s-CNNC) models that use filter clusters instead of individual filters is presented. Interacting with the image CNN models, the s-CNNC models can further boost image classification under a multimodal framework, which can be trained end-to-end. Chapter 6 develops an adaptive encoder-decoder attention network that uses web text to aid one-shot image classification. Without any ground truth semantic clues, e.g., class tag information, our model is able to extract useful information from web-sourced data instead. To address the noise nature of web text, the adaptive mechanism is introduced to determine when to attend to text-inferred visual features and when to rely on original visual features. The summarization and future prospect of my PhD work are finally discussed in Chapter 7.