MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation

We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a usefu...

Full description

Saved in:
Bibliographic Details
Main Authors: Xie, Jiahao, Li, Wei, Li, Xiangtai, Liu, Ziwei, Ong, Yew Soon, Loy, Chen Change
Other Authors: College of Computing and Data Science
Format: Article
Language:English
Published: 2025
Subjects:
Online Access:https://hdl.handle.net/10356/182213
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-182213
record_format dspace
spelling sg-ntu-dr.10356-1822132025-01-15T00:53:45Z MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation Xie, Jiahao Li, Wei Li, Xiangtai Liu, Ziwei Ong, Yew Soon Loy, Chen Change College of Computing and Data Science S-Lab Computer and Information Science Text-to-image diffusion models Long tail We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code: https://github.com/Jiahao000/MosaicFusion. Agency for Science, Technology and Research (A*STAR) Ministry of Education (MOE) Nanyang Technological University This study is supported under the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). The project is also supported by NTU NAP and Singapore MOE AcRF Tier 2 (MOE-T2EP20120-0001, MOE-T2EP20221-0012). 2025-01-15T00:53:45Z 2025-01-15T00:53:45Z 2024 Journal Article Xie, J., Li, W., Li, X., Liu, Z., Ong, Y. S. & Loy, C. C. (2024). MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation. International Journal of Computer Vision. https://dx.doi.org/10.1007/s11263-024-02223-3 0920-5691 https://hdl.handle.net/10356/182213 10.1007/s11263-024-02223-3 2-s2.0-85205847558 en MOE-T2EP20120-0001 MOE-T2EP20221-0012 IAF-ICP NTU NAP International Journal of Computer Vision © 2024 The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature. All rights reserved.
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Text-to-image diffusion models
Long tail
spellingShingle Computer and Information Science
Text-to-image diffusion models
Long tail
Xie, Jiahao
Li, Wei
Li, Xiangtai
Liu, Ziwei
Ong, Yew Soon
Loy, Chen Change
MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation
description We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code: https://github.com/Jiahao000/MosaicFusion.
author2 College of Computing and Data Science
author_facet College of Computing and Data Science
Xie, Jiahao
Li, Wei
Li, Xiangtai
Liu, Ziwei
Ong, Yew Soon
Loy, Chen Change
format Article
author Xie, Jiahao
Li, Wei
Li, Xiangtai
Liu, Ziwei
Ong, Yew Soon
Loy, Chen Change
author_sort Xie, Jiahao
title MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation
title_short MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation
title_full MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation
title_fullStr MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation
title_full_unstemmed MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation
title_sort mosaicfusion: diffusion models as data augmenters for large vocabulary instance segmentation
publishDate 2025
url https://hdl.handle.net/10356/182213
_version_ 1821833188201201664