Coherent visual story generation using diffusion models

Recent years, the advent of diffusion models has unlocked new possibilities in generative tasks, particularly in the realm of text-to-image generation. State-of-art models can create exquisite images that both satisfy users’ requirements and contain lots of details. In the last few years, some works...

Full description

Saved in:
Bibliographic Details
Main Author: Jiang, Jiaxi
Other Authors: Liu Ziwei
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175145
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Recent years, the advent of diffusion models has unlocked new possibilities in generative tasks, particularly in the realm of text-to-image generation. State-of-art models can create exquisite images that both satisfy users’ requirements and contain lots of details. In the last few years, some works have explored the potential of diffusion models in story visualization and obtained great achievements. These methods design specific network structures and train on close-set images-text pairs to enforce the consistency of image sequences, effectively generating coherent characters and scenes. However, due to the close-set setting, the fine-tuned models can only generate stories within a specific domain of predefined characters. To generalize to another domain, massive training with a newly curated image-text story dataset is required. This greatly limits the potential of existing visual story generation approaches, restricting them from being used on real-world, open-set applications. In this research project, we aim to explore generalizable approaches for generating coherent visual story images from text descriptions using diffusion models. Unlike the previous line of visual story generation works on close-set datasets, this project focuses on the open-set scenario to mimic real-world challenges. Our key idea is to efficiently extract new visual concepts from only a small number of customized images, and then use the learned concept to generate story image sequences. In this way, the coherency of the story images sequence is ensured by containing consistent main characters. By incorporating state-of-the-art customization techniques on diffusion models, we effectively bridge the gap between visual and linguistic elements, generating coherent visual stories with diverse story text descriptions. To better support this task, we further contribute an OpenStory dataset for benchmark purpose. Qualitative and quantitative experiments demonstrate the effectiveness of the approach proposed in this project.