Learning deep networks for image classification
Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processi...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/175074 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-175074 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1750742024-04-19T15:46:03Z Learning deep networks for image classification Zhou, Yixuan Hanwang Zhang School of Computer Science and Engineering hanwangzhang@ntu.edu.sg Computer and Information Science Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processing and reasoning, leading to constraints in both interpretation and generalization. The exploration of modular program learning emerges as a promising alternative, although its implementation proves intricate due to the challenges in learning the modules and programs simultaneously. This project introduces VQA-GPT, a framework employing code generation models and the Python interpreter for composing vision-and-language modules to produce results for textual queries. This zero-shot method outperforms traditional end-to-end models in solving various complex visual tasks. Bachelor's degree 2024-04-19T04:01:42Z 2024-04-19T04:01:42Z 2024 Final Year Project (FYP) Zhou, Y. (2024). Learning deep networks for image classification. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175074 https://hdl.handle.net/10356/175074 en SCSE23-0210 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science |
spellingShingle |
Computer and Information Science Zhou, Yixuan Learning deep networks for image classification |
description |
Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processing and reasoning, leading to constraints in both interpretation and generalization. The exploration of modular program learning emerges as a promising alternative, although its implementation proves intricate due to the challenges in learning the modules and programs simultaneously. This project introduces VQA-GPT, a framework employing code generation models and the Python interpreter for composing vision-and-language modules to produce results for textual queries. This zero-shot method outperforms traditional end-to-end models in solving various complex visual tasks. |
author2 |
Hanwang Zhang |
author_facet |
Hanwang Zhang Zhou, Yixuan |
format |
Final Year Project |
author |
Zhou, Yixuan |
author_sort |
Zhou, Yixuan |
title |
Learning deep networks for image classification |
title_short |
Learning deep networks for image classification |
title_full |
Learning deep networks for image classification |
title_fullStr |
Learning deep networks for image classification |
title_full_unstemmed |
Learning deep networks for image classification |
title_sort |
learning deep networks for image classification |
publisher |
Nanyang Technological University |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/175074 |
_version_ |
1800916358813188096 |