Referring expression segmentation: from conventional to generalized
In recent years, many remarkable achievements have been made in the field of deep machine learning in various data modalities, such as image processing and natural language comprehension. Based on the good performance of deep neural networks in single modalities, multi-modal tasks, which integrate d...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/175477 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | In recent years, many remarkable achievements have been made in the field of deep machine learning in various data modalities, such as image processing and natural language comprehension. Based on the good performance of deep neural networks in single modalities, multi-modal tasks, which integrate data from different modal domains, are becoming emerging research topics. Among the complex integrated tasks, one particularly challenging and important task is Referring Expression Segmentation (RES), which aims to generate a segmentation mask for a target object in a given image as described by a given natural language query expression, involving both computer vision and natural language processing. This thesis addresses the problem of RES from multiple angles to investigate the topic of this complex multi-modal task.
Firstly, we propose an efficient, instance-specific framework that optimizes the traditional CNN-RNN pipeline. Traditional RES methods usually either use an FCN-like network that directly generates the segmentation mask from the image or first extract all instances using a standalone network and then select the target from targets. We combine the strengths of both kinds of methods and propose a novel framework that can analyze the relationship among instances while maintaining the efficiency of the FCN-like network.
Secondly, we employ an attention-based network to model long-range dependencies in both image and language modalities. In CNN networks, the large receptive field is achieved by stacking multiple small-kernel convolutional layers, which is indirect and lacks efficiency when exchanging long-distance features. From this point, we utilize the Transformer-based network that can model long-range dependencies in a more efficient way. Next, based on this work, we find that the generic attention mechanism used in the classic Transformer is designed for processing single-modal data. We further enhance the mechanism of generic attention with feature-fusing capabilities, achieving denser feature fusion.
Lastly, to accommodate multi-object and no-object expressions, we introduce a novel task called Generalized Referring Expression Segmentation (GRES). To facilitate research in this field, we also construct a large-scale dataset for GRES and design a baseline method, namely ReLA. The proposed method implicitly divides the image into regions and explicitly analyzes the relationship among them, achieving state-of-the-art performance on both RES and GRES datasets.
Our proposed approach advances the state-of-the-art in referring segmentation, and further generalizes the conventional RES to Generalized RES, providing new insights, methods and topics for further research in this field. |
---|