MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation
We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a usefu...
Saved in:
Main Authors: | , , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2025
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/182213 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-182213 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1822132025-01-15T00:53:45Z MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation Xie, Jiahao Li, Wei Li, Xiangtai Liu, Ziwei Ong, Yew Soon Loy, Chen Change College of Computing and Data Science S-Lab Computer and Information Science Text-to-image diffusion models Long tail We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code: https://github.com/Jiahao000/MosaicFusion. Agency for Science, Technology and Research (A*STAR) Ministry of Education (MOE) Nanyang Technological University This study is supported under the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). The project is also supported by NTU NAP and Singapore MOE AcRF Tier 2 (MOE-T2EP20120-0001, MOE-T2EP20221-0012). 2025-01-15T00:53:45Z 2025-01-15T00:53:45Z 2024 Journal Article Xie, J., Li, W., Li, X., Liu, Z., Ong, Y. S. & Loy, C. C. (2024). MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation. International Journal of Computer Vision. https://dx.doi.org/10.1007/s11263-024-02223-3 0920-5691 https://hdl.handle.net/10356/182213 10.1007/s11263-024-02223-3 2-s2.0-85205847558 en MOE-T2EP20120-0001 MOE-T2EP20221-0012 IAF-ICP NTU NAP International Journal of Computer Vision © 2024 The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature. All rights reserved. |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Text-to-image diffusion models Long tail |
spellingShingle |
Computer and Information Science Text-to-image diffusion models Long tail Xie, Jiahao Li, Wei Li, Xiangtai Liu, Ziwei Ong, Yew Soon Loy, Chen Change MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation |
description |
We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code: https://github.com/Jiahao000/MosaicFusion. |
author2 |
College of Computing and Data Science |
author_facet |
College of Computing and Data Science Xie, Jiahao Li, Wei Li, Xiangtai Liu, Ziwei Ong, Yew Soon Loy, Chen Change |
format |
Article |
author |
Xie, Jiahao Li, Wei Li, Xiangtai Liu, Ziwei Ong, Yew Soon Loy, Chen Change |
author_sort |
Xie, Jiahao |
title |
MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation |
title_short |
MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation |
title_full |
MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation |
title_fullStr |
MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation |
title_full_unstemmed |
MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation |
title_sort |
mosaicfusion: diffusion models as data augmenters for large vocabulary instance segmentation |
publishDate |
2025 |
url |
https://hdl.handle.net/10356/182213 |
_version_ |
1821833188201201664 |