Unsupervised modality adaptation with text-to-Image diffusion models for semantic segmentation

Despite their success, unsupervised domain adaptation methods for semantic segmentation primarily focus on adaptation between image domains and do not utilize other abundant visual modalities like depth, infrared and event. This limitation hinders their performance and restricts their application in...

Full description

Saved in:
Bibliographic Details
Main Authors: XIA, Ruihao, LIANG, Yu, JIANG, Peng-Tao, ZHANG, Hao, LI, Bo, TANG, Yang, ZHOU, Pan
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9729
https://ink.library.smu.edu.sg/context/sis_research/article/10729/viewcontent/2410.21708v1.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10729
record_format dspace
spelling sg-smu-ink.sis_research-107292024-12-16T06:55:30Z Unsupervised modality adaptation with text-to-Image diffusion models for semantic segmentation XIA, Ruihao LIANG, Yu JIANG, Peng-Tao ZHANG, Hao LI, Bo TANG, Yang ZHOU, Pan Despite their success, unsupervised domain adaptation methods for semantic segmentation primarily focus on adaptation between image domains and do not utilize other abundant visual modalities like depth, infrared and event. This limitation hinders their performance and restricts their application in real-world multimodal scenarios. To address this issue, we propose Modality Adaptation with text-toimage Diffusion Models (MADM) for semantic segmentation task which utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model’s cross-modality capabilities. Specifically, MADM comprises two key complementary components to tackle major challenges. First, due to the large modality gap, using one modal data to generate pseudo labels for another modality suffers from a significant drop in accuracy. To address this, MADM designs diffusion-based pseudo-label generation which adds latent noise to stabilize pseudolabels and enhance label accuracy. Second, to overcome the limitations of latent low-resolution features in diffusion models, MADM introduces the label palette and latent regression which converts one-hot encoded labels into the RGB form by palette and regresses them in the latent space, thus ensuring the pre-trained decoder for up-sampling to obtain fine-grained features. Extensive experimental results demonstrate that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities. 2024-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9729 info:doi/https://nips.cc/virtual/2024/poster/96606 https://ink.library.smu.edu.sg/context/sis_research/article/10729/viewcontent/2410.21708v1.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Semantic Segmentation Modality adaptation Text-to-Image diffusion models Domain adaptation Computer Sciences Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Semantic Segmentation
Modality adaptation
Text-to-Image diffusion models
Domain adaptation
Computer Sciences
Graphics and Human Computer Interfaces
spellingShingle Semantic Segmentation
Modality adaptation
Text-to-Image diffusion models
Domain adaptation
Computer Sciences
Graphics and Human Computer Interfaces
XIA, Ruihao
LIANG, Yu
JIANG, Peng-Tao
ZHANG, Hao
LI, Bo
TANG, Yang
ZHOU, Pan
Unsupervised modality adaptation with text-to-Image diffusion models for semantic segmentation
description Despite their success, unsupervised domain adaptation methods for semantic segmentation primarily focus on adaptation between image domains and do not utilize other abundant visual modalities like depth, infrared and event. This limitation hinders their performance and restricts their application in real-world multimodal scenarios. To address this issue, we propose Modality Adaptation with text-toimage Diffusion Models (MADM) for semantic segmentation task which utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model’s cross-modality capabilities. Specifically, MADM comprises two key complementary components to tackle major challenges. First, due to the large modality gap, using one modal data to generate pseudo labels for another modality suffers from a significant drop in accuracy. To address this, MADM designs diffusion-based pseudo-label generation which adds latent noise to stabilize pseudolabels and enhance label accuracy. Second, to overcome the limitations of latent low-resolution features in diffusion models, MADM introduces the label palette and latent regression which converts one-hot encoded labels into the RGB form by palette and regresses them in the latent space, thus ensuring the pre-trained decoder for up-sampling to obtain fine-grained features. Extensive experimental results demonstrate that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
format text
author XIA, Ruihao
LIANG, Yu
JIANG, Peng-Tao
ZHANG, Hao
LI, Bo
TANG, Yang
ZHOU, Pan
author_facet XIA, Ruihao
LIANG, Yu
JIANG, Peng-Tao
ZHANG, Hao
LI, Bo
TANG, Yang
ZHOU, Pan
author_sort XIA, Ruihao
title Unsupervised modality adaptation with text-to-Image diffusion models for semantic segmentation
title_short Unsupervised modality adaptation with text-to-Image diffusion models for semantic segmentation
title_full Unsupervised modality adaptation with text-to-Image diffusion models for semantic segmentation
title_fullStr Unsupervised modality adaptation with text-to-Image diffusion models for semantic segmentation
title_full_unstemmed Unsupervised modality adaptation with text-to-Image diffusion models for semantic segmentation
title_sort unsupervised modality adaptation with text-to-image diffusion models for semantic segmentation
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/9729
https://ink.library.smu.edu.sg/context/sis_research/article/10729/viewcontent/2410.21708v1.pdf
_version_ 1819113121214627840