Unified information fusion network for multi-modal RGB-D and RGB-T salient object detection
The use of complementary information, namely depth or thermal information, has shown its benefits to salient object detection (SOD) during recent years. However, the RGB-D or RGB-T SOD problems are currently only solved independently, and most of them directly extract and fuse raw features from...
Saved in:
Main Authors: | , , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/150772 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | The use of complementary information, namely
depth or thermal information, has shown its benefits to salient object detection (SOD) during recent years. However, the RGB-D or
RGB-T SOD problems are currently only solved independently,
and most of them directly extract and fuse raw features from
backbones. Such methods can be easily restricted by low-quality
modality data and redundant cross-modal features. In this work,
a unified end-to-end framework is designed to simultaneously
analyze RGB-D and RGB-T SOD tasks. Specifically, to effectively
tackle multi-modal features, we propose a novel multi-stage and
multi-scale fusion network (MMNet), which consists of a crossmodal multi-stage fusion module (CMFM) and a bi-directional
multi-scale decoder (BMD). Similar to the visual color stage
doctrine in the human visual system (HVS), the proposed CMFM
aims to explore important feature representations in feature
response stage, and integrate them into cross-modal features
in adversarial combination stage. Moreover, the proposed BMD
learns the combination of multi-level cross-modal fused features
to capture both local and global information of salient objects, and can further boost the multi-modal SOD performance.
The proposed unified cross-modality feature analysis framework
based on two-stage and multi-scale information fusion can be
used for diverse multi-modal SOD tasks. Comprehensive experiments (∼92K image-pairs) demonstrate that the proposed method
consistently outperforms the other 21 state-of-the-art methods
on nine benchmark datasets. This validates that our proposed
method can work well on diverse multi-modal SOD tasks with
good generalization and robustness, and provides a good multimodal SOD benchmark. |
---|