Text-to-drawing translation with limited data

Text-to-image translation has seen significant development with the assistance of enormous datasets and novel technologies. OpenAI's CLIP (Contrastive Language-Image Pretraining) is a comprehensive pre-trained neural network that encodes text and image in the same embedding space, thus providin...

Full description

Saved in:
Bibliographic Details
Main Author: Deng, Ziyang
Other Authors: Chen Change Loy
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/166685
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-166685
record_format dspace
spelling sg-ntu-dr.10356-1666852023-05-12T15:36:54Z Text-to-drawing translation with limited data Deng, Ziyang Chen Change Loy School of Computer Science and Engineering ccloy@ntu.edu.sg Engineering::Computer science and engineering Text-to-image translation has seen significant development with the assistance of enormous datasets and novel technologies. OpenAI's CLIP (Contrastive Language-Image Pretraining) is a comprehensive pre-trained neural network that encodes text and image in the same embedding space, thus providing the ability to correlate visual features to semantic words. Popular text-to-image models such as DALL-E 2, VQGAN-CLIP and Stable Diffusion all utilize CLIP's power in some ways. While the market is mainly dominated by autoregressive (AR) and diffusion models, traditional generative adversarial networks (GANs) are capable of producing high-quality images and require much less training data. In this project, with the help of CLIP, we explore the potentials of StyleGAN3 in the context of text-to-image translation, on a custom dataset with 20k text-image pairs. We demonstrate 3 techniques with CLIP: image re-ranking, CLIP loss and CLIP embedding as latent. We experiment with the three settings and find out no positive results in correlations of texts and generated images. We draw the conclusion that despite StyleGAN being powerful on its own, a strong text encoder is equally important to make a good text-to-image AI model. Bachelor of Engineering (Computer Engineering) 2023-05-09T05:12:51Z 2023-05-09T05:12:51Z 2023 Final Year Project (FYP) Deng, Z. (2023). Text-to-drawing translation with limited data. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/166685 https://hdl.handle.net/10356/166685 en SCSE22-0309 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Deng, Ziyang
Text-to-drawing translation with limited data
description Text-to-image translation has seen significant development with the assistance of enormous datasets and novel technologies. OpenAI's CLIP (Contrastive Language-Image Pretraining) is a comprehensive pre-trained neural network that encodes text and image in the same embedding space, thus providing the ability to correlate visual features to semantic words. Popular text-to-image models such as DALL-E 2, VQGAN-CLIP and Stable Diffusion all utilize CLIP's power in some ways. While the market is mainly dominated by autoregressive (AR) and diffusion models, traditional generative adversarial networks (GANs) are capable of producing high-quality images and require much less training data. In this project, with the help of CLIP, we explore the potentials of StyleGAN3 in the context of text-to-image translation, on a custom dataset with 20k text-image pairs. We demonstrate 3 techniques with CLIP: image re-ranking, CLIP loss and CLIP embedding as latent. We experiment with the three settings and find out no positive results in correlations of texts and generated images. We draw the conclusion that despite StyleGAN being powerful on its own, a strong text encoder is equally important to make a good text-to-image AI model.
author2 Chen Change Loy
author_facet Chen Change Loy
Deng, Ziyang
format Final Year Project
author Deng, Ziyang
author_sort Deng, Ziyang
title Text-to-drawing translation with limited data
title_short Text-to-drawing translation with limited data
title_full Text-to-drawing translation with limited data
title_fullStr Text-to-drawing translation with limited data
title_full_unstemmed Text-to-drawing translation with limited data
title_sort text-to-drawing translation with limited data
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/166685
_version_ 1770564192188760064