Text-to-drawing translation with limited data
Text-to-image translation has seen significant development with the assistance of enormous datasets and novel technologies. OpenAI's CLIP (Contrastive Language-Image Pretraining) is a comprehensive pre-trained neural network that encodes text and image in the same embedding space, thus providin...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/166685 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Text-to-image translation has seen significant development with the assistance of enormous datasets and novel technologies. OpenAI's CLIP (Contrastive Language-Image Pretraining) is a comprehensive pre-trained neural network that encodes text and image in the same embedding space, thus providing the ability to correlate visual features to semantic words. Popular text-to-image models such as DALL-E 2, VQGAN-CLIP and Stable Diffusion all utilize CLIP's power in some ways. While the market is mainly dominated by autoregressive (AR) and diffusion models, traditional generative adversarial networks (GANs) are capable of producing high-quality images and require much less training data. In this project, with the help of CLIP, we explore the potentials of StyleGAN3 in the context of text-to-image translation, on a custom dataset with 20k text-image pairs. We demonstrate 3 techniques with CLIP: image re-ranking, CLIP loss and CLIP embedding as latent. We experiment with the three settings and find out no positive results in correlations of texts and generated images. We draw the conclusion that despite StyleGAN being powerful on its own, a strong text encoder is equally important to make a good text-to-image AI model. |
---|