Coherent visual story generation using diffusion models
Recent years, the advent of diffusion models has unlocked new possibilities in generative tasks, particularly in the realm of text-to-image generation. State-of-art models can create exquisite images that both satisfy users’ requirements and contain lots of details. In the last few years, some works...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/175145 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Recent years, the advent of diffusion models has unlocked new possibilities in generative tasks, particularly in the realm of text-to-image generation. State-of-art models can create exquisite images that both satisfy users’ requirements and contain lots of details. In the last few years, some works have explored the potential of diffusion models in story visualization and obtained great achievements. These methods design specific network structures and train on close-set images-text pairs to enforce the consistency of image sequences, effectively generating coherent characters and scenes. However, due to the close-set setting, the fine-tuned models can only generate stories within a specific domain of predefined characters. To generalize to another domain, massive training with a newly curated image-text story dataset is required. This greatly limits the potential of existing visual story generation approaches, restricting them from being used on real-world, open-set applications. In this research project, we aim to explore generalizable approaches for generating coherent visual story images from text descriptions using diffusion models. Unlike the previous line of visual story generation works on close-set datasets, this project focuses on the open-set scenario to mimic real-world challenges. Our key idea is to efficiently extract new visual concepts from only a small number of customized images, and then use the learned concept to generate story image sequences. In this way, the coherency of the story images sequence is ensured by containing consistent main characters. By incorporating state-of-the-art customization techniques on diffusion models, we effectively bridge the gap between visual and linguistic elements, generating coherent visual stories with diverse story text descriptions. To better support this task, we further contribute an OpenStory dataset for benchmark purpose. Qualitative and quantitative experiments demonstrate the effectiveness of the approach proposed in this project. |
---|