Towards unbiased visual language reasoning and consistent segmentation

In recent years, we have made significant advances in standard recognition tasks such as classification, detection or segmentation. To further understand from vi- sion, more and more researchers pay attention to introduce text information for reasoning. Such as image caption, visual question answeri...

Full description

Saved in:
Bibliographic Details
Main Author: Huang, Jianqiang
Other Authors: Hanwang Zhang
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/169540
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:In recent years, we have made significant advances in standard recognition tasks such as classification, detection or segmentation. To further understand from vi- sion, more and more researchers pay attention to introduce text information for reasoning. Such as image caption, visual question answering (VQA) and visual grounding (VG). For a general visual language framework, both visual feature and language feature are firstly extracted by backbone, then they are aggregated for the downstream task with an end-to-end manner. However, the intrinsic data bias haven’t been fully addressed, thus the training process are usually biased, which leads to biased inference. To tackle these challenges, we conducted two works to address the co-occurence bias and language bias. The main contributions are summarized as bellow: • We proposed a general deconfounded module for visual grounding. we find that the bias is the major visual reasoning bottleneck. For example, the grounding process is usually a trivial language-location association without visual reasoning, e.g., grounding any language query containing sheep to the nearly central regions, due to that most queries about sheep have ground-truth locations at the image center. First, we frame the visual grounding pipeline into a causal graph, which shows the causalities among image, query, target location and underlying confounder. Through the causal graph, we know how to break the grounding bottleneck: deconfounded visual grounding. Second, to tackle the challenge that the confounder is unobserved in general, we propose a confounder- agnostic approach called: Referring Expression Deconfounder (RED), to remove the confounding bias. Third, we implement RED as a simple language attention, which can be applied in any grounding method. On popular benchmarks, RED improves various state-of-the-art grounding methods by a significant margin. • We introduce Visual Commonsense Region-based Convolutional Neu- ral Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected ob- ject regions in an image (e.g., by Faster R-CNN), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fun- damentally different: the prediction of VC R-CNN is by causal intervention: P(Y |do(X)), while others are by the conventional likelihood: P(Y |X). This is also the core reason why VC R-CNN can learn “sense-making” knowledge — like chair can be sat — while not just “common” co-occurrences — chair is likely to exist if table is observed. We extensively apply VC R-CNN features in prevailing models of three popular tasks: Image Captioning, VQA, and VCR, and observe consistent performance boosts across all the methods and tasks, achieving many new state-of-the-arts. Segmentation aims to infer the object label on each pixel for an image or a video is one of the fundamental task in computer vision. Weakly supervised segmentation is especially important and challenge since the pixel-level annotation is really expen- sive. The general weakly supervised segmentation paradigm suffers from a dilemma that the global feature maps tend to capture the structural information, while the local feature maps prefer the boundary details. The existing methods tried to cou- pling them by optimizing a unified objective function. Indeed, this fusion strategy can fit the training data well, but the contributions of them are not clear. Actu- ally, the inconsistency between local and global information can introduce conflicts during training stage, which results in unstable performance for testing dataset. In this thesis, we consider two sub-tasks, i.e., weakly supervised semantic segmen- tation (WSSS) and semi-supervised video object segmentation (VOS). For both of them, we found the inconsistencies of existing learning paradigms and proposed new approach for consistent segmentation. The main contributions are summarized as bellow: • We proposed a novel AD-CAM for weakly supervised semantic segmen- tation(WSSS). We explored the way of coupling the long-range attentions in ViT with the local attention maps generated by CNN, aiming for better pseudo masks to train semantic segmentation models. We found the key issue, i.e., increasing false positives, in straight-forward coupling comes from the spurious dependencies learned in ViT. We addressed this by proposing a novel AD-CAM that integrates ViT attention and CAM activation conservatively though prob- abilistic diffusion. It first refines the initial attention map by introducing co- neighbor similarity, which take the neighboring information to generated more confident (less spurious) attention. Then, it performs attention-based CAM dif- fusion by diffusing the activation of a pixel to its neighbors in proportion to the corresponding attention. Extensive experiments and analyses on popular WSSS benchmarks validated the superiority of AD-CAM over other ViT or CNN based methods. We introduce Scale-Cooperative Matching (SCM) for video object seg- mentation (VOS) by reducing the conflicted gradients of different scales. We firstly discover the non-cooperative matching problem across scales. Specifi- cally, the coarser feature maps tend to capture the structural information while the finer ones prefer small local patterns. Therefore, their matching results can deviate from each other at the same feature pixel, which further causes conflict- ing gradients for both modules. To address this problem, we adopt the strategy of gradient surgery and proposed the first scale-cooperative learning framework for the multi-scale training of VOS models. It automatically eliminates the con- flict components of gradients and thus ensures the overall gradient benefiting all scales during the entire optimization. Not surprisingly, the proposed method outperforms the previous state-of-the-art methods on two popular multi-object benchmarks, YouTube-VOS and DAVIS 2017, and one single-object benchmark, DAVIS 2016. In addition, the qualitative examples further demonstrate that our learning framework can preserve the integrity of objects while obtaining better boundaries.