Learning multi-modal scale-aware attentions for efficient and robust road segmentation

Multi-modal fusion has proven to be beneficial to road segmentation in autonomous driving, where depth is commonly used as complementary data for RGB images to provide robust 3D geometry information. Existing methods adopt an encoder-decoder structure to fuse two modalities for segmentation through...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Zhou, Yunjiao
مؤلفون آخرون: Xie Lihua
التنسيق: Thesis-Master by Coursework
اللغة:English
منشور في: Nanyang Technological University 2022
الموضوعات:
الوصول للمادة أونلاين:https://hdl.handle.net/10356/159277
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة: Nanyang Technological University
اللغة: English
الوصف
الملخص:Multi-modal fusion has proven to be beneficial to road segmentation in autonomous driving, where depth is commonly used as complementary data for RGB images to provide robust 3D geometry information. Existing methods adopt an encoder-decoder structure to fuse two modalities for segmentation through encoding and concatenating high-level and low-level features. However, this leads to increasing semantic gaps not only among modalities, but also different scales, which are detrimental to road segmentation. To overcome this challenge and obtain robust features, we propose a Multi-modal Scale-aware Attention Network (MSAN), to fuse RGB and depth data effectively via a novel transformer-based cross-attention module, namely Multi-modal Scare-aware Transformer (MST), which fuses RGB-D features across multiple scales at the encoder stage. To better consolidate different scales of feature, we further propose a Scale-aware Attention Module (SAM) that captures channel-wise attention for cross-scale fusion. The two attention-based modules focus on exploring the complementarity of modalities and the different importance of scales to narrow the gaps for road segmentation. Extensive experiments demonstrate that our method achieves competitive segmentation performance at a low computational cost.