Text-to-drawing translation with limited data
Text-to-image translation has seen significant development with the assistance of enormous datasets and novel technologies. OpenAI's CLIP (Contrastive Language-Image Pretraining) is a comprehensive pre-trained neural network that encodes text and image in the same embedding space, thus providin...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/166685 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-166685 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1666852023-05-12T15:36:54Z Text-to-drawing translation with limited data Deng, Ziyang Chen Change Loy School of Computer Science and Engineering ccloy@ntu.edu.sg Engineering::Computer science and engineering Text-to-image translation has seen significant development with the assistance of enormous datasets and novel technologies. OpenAI's CLIP (Contrastive Language-Image Pretraining) is a comprehensive pre-trained neural network that encodes text and image in the same embedding space, thus providing the ability to correlate visual features to semantic words. Popular text-to-image models such as DALL-E 2, VQGAN-CLIP and Stable Diffusion all utilize CLIP's power in some ways. While the market is mainly dominated by autoregressive (AR) and diffusion models, traditional generative adversarial networks (GANs) are capable of producing high-quality images and require much less training data. In this project, with the help of CLIP, we explore the potentials of StyleGAN3 in the context of text-to-image translation, on a custom dataset with 20k text-image pairs. We demonstrate 3 techniques with CLIP: image re-ranking, CLIP loss and CLIP embedding as latent. We experiment with the three settings and find out no positive results in correlations of texts and generated images. We draw the conclusion that despite StyleGAN being powerful on its own, a strong text encoder is equally important to make a good text-to-image AI model. Bachelor of Engineering (Computer Engineering) 2023-05-09T05:12:51Z 2023-05-09T05:12:51Z 2023 Final Year Project (FYP) Deng, Z. (2023). Text-to-drawing translation with limited data. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/166685 https://hdl.handle.net/10356/166685 en SCSE22-0309 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering |
spellingShingle |
Engineering::Computer science and engineering Deng, Ziyang Text-to-drawing translation with limited data |
description |
Text-to-image translation has seen significant development with the assistance of enormous datasets and novel technologies. OpenAI's CLIP (Contrastive Language-Image Pretraining) is a comprehensive pre-trained neural network that encodes text and image in the same embedding space, thus providing the ability to correlate visual features to semantic words. Popular text-to-image models such as DALL-E 2, VQGAN-CLIP and Stable Diffusion all utilize CLIP's power in some ways. While the market is mainly dominated by autoregressive (AR) and diffusion models, traditional generative adversarial networks (GANs) are capable of producing high-quality images and require much less training data. In this project, with the help of CLIP, we explore the potentials of StyleGAN3 in the context of text-to-image translation, on a custom dataset with 20k text-image pairs. We demonstrate 3 techniques with CLIP: image re-ranking, CLIP loss and CLIP embedding as latent. We experiment with the three settings and find out no positive results in correlations of texts and generated images. We draw the conclusion that despite StyleGAN being powerful on its own, a strong text encoder is equally important to make a good text-to-image AI model. |
author2 |
Chen Change Loy |
author_facet |
Chen Change Loy Deng, Ziyang |
format |
Final Year Project |
author |
Deng, Ziyang |
author_sort |
Deng, Ziyang |
title |
Text-to-drawing translation with limited data |
title_short |
Text-to-drawing translation with limited data |
title_full |
Text-to-drawing translation with limited data |
title_fullStr |
Text-to-drawing translation with limited data |
title_full_unstemmed |
Text-to-drawing translation with limited data |
title_sort |
text-to-drawing translation with limited data |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/166685 |
_version_ |
1770564192188760064 |