Visual dialog system
In this era of Artificial Intelligence, Large Language Models (LLMs) have emerged as powerful tools, facilitating the revolution of natural language understanding and generation tasks across different domains. Many innovations such as OpenAI’s Generative Pretrained Transformer (GPT) series have prov...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/175080 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-175080 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1750802024-04-19T15:45:28Z Visual dialog system Luong, Hien Nga Hanwang Zhang School of Computer Science and Engineering hanwangzhang@ntu.edu.sg Computer and Information Science Visual dialog In this era of Artificial Intelligence, Large Language Models (LLMs) have emerged as powerful tools, facilitating the revolution of natural language understanding and generation tasks across different domains. Many innovations such as OpenAI’s Generative Pretrained Transformer (GPT) series have proven their outstanding ability to comprehend and generate coherent text. The continuous evolution of Artificial Intelligence has led to advancement beyond linguistic abilities, enabling integration of multimodal functionalities. Multimodal Large Language Models (MLLMs) demonstrate a remarkable development by extending the capabilities of LLMs to visual and auditory information. With this modality integration, models are able to process and comprehend more diverse input channels, such as images, audios, or videos. Among the modalities, images stand out as the most utilised mean of communication, attracting a large volume of research and development to visual language models. With an incredible proficiency in visual comprehension and reasoning, MLLMs can serve as a significant aid in practical applications, including image captioning and visual question answering. Understanding the need to make the powerful MLLMs accessible to general users, development of a user-friendly Visual Dialog System becomes pivotal. Serving as a bridge between users and MLLMs, this system can facilitate seamless multi-round conversation involving images and texts. Additionally, to give MLLM necessary contextual information to ensure smooth multi-round conversation, a proper instructional prompting scheme is important. This project aims to develop a visual dialog system employing MLLM with an appropriate prompting scheme and web User Interface (UI) that integrates textual and visual elements cohesively, allowing interactive conversations between state-of-the-art MLLMs and users. The initial step involves providing model with historical information to ensure a smooth multi-round conversation. Subsequently, a UI needs to be created with interactive components allowing users to input images, queries and receive responses from the MLLM. In this project, prompting combining the new question with the summarisation of two previous answers yields an impressive result in increasing users' satisfaction by nearly 50\% compared to no contextual prompting, highlighting its potential as a promising cost-efficient contextual provision at inference time for visual dialog system. The outcome of this project potentially lays a groundwork for further domain-specific applications, including education, content creation or virtual assistants, where visual dialogs play crucial role in helping humans utilise the powerfulness of AI. Bachelor's degree 2024-04-19T04:22:51Z 2024-04-19T04:22:51Z 2024 Final Year Project (FYP) Luong, H. N. (2024). Visual dialog system. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175080 https://hdl.handle.net/10356/175080 en SCSE23-0215 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Visual dialog |
spellingShingle |
Computer and Information Science Visual dialog Luong, Hien Nga Visual dialog system |
description |
In this era of Artificial Intelligence, Large Language Models (LLMs) have emerged as powerful tools, facilitating the revolution of natural language understanding and generation tasks across different domains. Many innovations such as OpenAI’s Generative Pretrained Transformer (GPT) series have proven their outstanding ability to comprehend and generate coherent text. The continuous evolution of Artificial Intelligence has led to advancement beyond linguistic abilities, enabling integration of multimodal functionalities.
Multimodal Large Language Models (MLLMs) demonstrate a remarkable development by extending the capabilities of LLMs to visual and auditory information. With this modality integration, models are able to process and comprehend more diverse input channels, such as images, audios, or videos. Among the modalities, images stand out as the most utilised mean of communication, attracting a large volume of research and development to visual language models. With an incredible proficiency in visual comprehension and reasoning, MLLMs can serve as a significant aid in practical applications, including image captioning and visual question answering. Understanding the need to make the powerful MLLMs accessible to general users, development of a user-friendly Visual Dialog System becomes pivotal. Serving as a bridge between users and MLLMs, this system can facilitate seamless multi-round conversation involving images and texts. Additionally, to give MLLM necessary contextual information to ensure smooth multi-round conversation, a proper instructional prompting scheme is important.
This project aims to develop a visual dialog system employing MLLM with an appropriate prompting scheme and web User Interface (UI) that integrates textual and visual elements cohesively, allowing interactive conversations between state-of-the-art MLLMs and users. The initial step involves providing model with historical information to ensure a smooth multi-round conversation. Subsequently, a UI needs to be created with interactive components allowing users to input images, queries and receive responses from the MLLM. In this project, prompting combining the new question with the summarisation of two previous answers yields an impressive result in increasing users' satisfaction by nearly 50\% compared to no contextual prompting, highlighting its potential as a promising cost-efficient contextual provision at inference time for visual dialog system. The outcome of this project potentially lays a groundwork for further domain-specific applications, including education, content creation or virtual assistants, where visual dialogs play crucial role in helping humans utilise the powerfulness of AI. |
author2 |
Hanwang Zhang |
author_facet |
Hanwang Zhang Luong, Hien Nga |
format |
Final Year Project |
author |
Luong, Hien Nga |
author_sort |
Luong, Hien Nga |
title |
Visual dialog system |
title_short |
Visual dialog system |
title_full |
Visual dialog system |
title_fullStr |
Visual dialog system |
title_full_unstemmed |
Visual dialog system |
title_sort |
visual dialog system |
publisher |
Nanyang Technological University |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/175080 |
_version_ |
1806059740813328384 |