Semi-supervised learning for visual relation annotation

Due to the powerful ability to learn low-level and high-level general visual features, deep neural networks (DNNs) have been used as the basic structure in many CV applications such as object detection, semantic segmentation, relation detection, and annotation, etc. While most of the research focuse...

Full description

Saved in:
Bibliographic Details
Main Author: Tajrobehkar, Mitra
Other Authors: Zhang Hanwang
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/161913
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Due to the powerful ability to learn low-level and high-level general visual features, deep neural networks (DNNs) have been used as the basic structure in many CV applications such as object detection, semantic segmentation, relation detection, and annotation, etc. While most of the research focuses are on maximizing overall performance during training a machine learning model, not much attention is given to evaluate the robustness i.e. against visual content manipulation, until very recent years. Lack of model robustness especially with respect to consistency and discrimination can be due to various reasons, e.g. data distributions, inadequacy in learning process, model sensitivity to different regions of feature space, etc. Discrimination refers to the model capability of predictions to distinguish between individual class samples, while consistency refers to the model capability of predictions to remain stable despite input variations. To address these challenges, the central focus of this thesis is on representation learning – for two most challenging Computer Vision (CV) tasks: Scene Graph Generation (SGG), and Visual Question Answering (VQA). For the first research direction, we propose a novel head network to tackle the problem of non-discrimination by learning semantic pairwise feature representation. The second research direction addresses the model instability by generating better representation. Through a consensus model, common feature representations that are reasoned from various samples are learned to increase the robustness of consensus. In summary, the major contributions of this thesis are as follows: We propose a meta-architecture — learning-to-align —, called Align R-CNN, for dynamic object feature concatenation to deal with visual reasoning task. Taking scene graph generation as an example, humans learn to describe visual relationships between objects, semantically (e.g., riding behind, sitting on). We propose a semantic transformation that parses an input image into <subject-relation-object> triplet and then extracts visually grounded semantic features by re-aligning features of subject from its relative object and relation. We argue that the previous works are highly limited by naive concatenation and as a result, they fail to discriminate between riding and feeding for object pair of person and horse. Moreover, naive concatenated pairwise features may collapse less frequent but meaningful predicate e.g., sitting on into more frequent but meaningless one e.g., on. Compared with existing model relation representations that utilize scene graphs to connect the objects, the proposed Align R-CNN has two key advantages: 1) maintains good representations during training while removing the irrelevant features from the objects; 2) dynamic learning, which enables model deals with different pairs. These advantages prevent the proposed Align R-CNN from over-fitting with the biased dataset. Note that the proposed framework can be utilized in a community that seeks zero-shot predictions. We propose a framework to enhance model consistency by generating desirable feature consistency. This line of research addresses lack of VQA models that measure the robustness of consensus against linguistic variations in ques- tions. For instance, while reference question “How many cars in picture?” and its syntactic one “How many automobiles are there?” are semantically identical in meaning, model may predict an incorrect answer for syntactic one. Besides, the model should be powerful enough to predict the right answer for “How many red cars seen in picture?” Inspired from unsupervised feature representation learning, we use contrastive learning, which sufficiently learns better representation from both vision and language inputs. However, we ar- gue that training the model with naive contrastive learning framework that using random intra-class and random non-target sequences as positive and negative examples is sub-optimal. Thus, it may not boost the model performance on robust VQA benchmark. The proposed method dedicates a principal head network to generate positive and negative samples for contrastive learning by adding adversarial perturbations. Specifically, it generates hard positive samples by adding large perturbations to both input images and questions to maximize the conditional likelihood. The proposed framework has two key advantages: 1) the generative model from embedding representation offers rich information to increase model stability; 2) by exploring the effects of single modalities and multi-modal attacks, the model mitigates correlation between the bias and the learned features.