Visual dialog system

In this era of Artificial Intelligence, Large Language Models (LLMs) have emerged as powerful tools, facilitating the revolution of natural language understanding and generation tasks across different domains. Many innovations such as OpenAI’s Generative Pretrained Transformer (GPT) series have prov...

Full description

Saved in:

Bibliographic Details
Main Author:	Luong, Hien Nga
Other Authors:	Hanwang Zhang
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science Visual dialog
Online Access:	https://hdl.handle.net/10356/175080
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-175080
record_format	dspace
spelling	sg-ntu-dr.10356-1750802024-04-19T15:45:28Z Visual dialog system Luong, Hien Nga Hanwang Zhang School of Computer Science and Engineering hanwangzhang@ntu.edu.sg Computer and Information Science Visual dialog In this era of Artificial Intelligence, Large Language Models (LLMs) have emerged as powerful tools, facilitating the revolution of natural language understanding and generation tasks across different domains. Many innovations such as OpenAI’s Generative Pretrained Transformer (GPT) series have proven their outstanding ability to comprehend and generate coherent text. The continuous evolution of Artificial Intelligence has led to advancement beyond linguistic abilities, enabling integration of multimodal functionalities. Multimodal Large Language Models (MLLMs) demonstrate a remarkable development by extending the capabilities of LLMs to visual and auditory information. With this modality integration, models are able to process and comprehend more diverse input channels, such as images, audios, or videos. Among the modalities, images stand out as the most utilised mean of communication, attracting a large volume of research and development to visual language models. With an incredible proficiency in visual comprehension and reasoning, MLLMs can serve as a significant aid in practical applications, including image captioning and visual question answering. Understanding the need to make the powerful MLLMs accessible to general users, development of a user-friendly Visual Dialog System becomes pivotal. Serving as a bridge between users and MLLMs, this system can facilitate seamless multi-round conversation involving images and texts. Additionally, to give MLLM necessary contextual information to ensure smooth multi-round conversation, a proper instructional prompting scheme is important. This project aims to develop a visual dialog system employing MLLM with an appropriate prompting scheme and web User Interface (UI) that integrates textual and visual elements cohesively, allowing interactive conversations between state-of-the-art MLLMs and users. The initial step involves providing model with historical information to ensure a smooth multi-round conversation. Subsequently, a UI needs to be created with interactive components allowing users to input images, queries and receive responses from the MLLM. In this project, prompting combining the new question with the summarisation of two previous answers yields an impressive result in increasing users' satisfaction by nearly 50\% compared to no contextual prompting, highlighting its potential as a promising cost-efficient contextual provision at inference time for visual dialog system. The outcome of this project potentially lays a groundwork for further domain-specific applications, including education, content creation or virtual assistants, where visual dialogs play crucial role in helping humans utilise the powerfulness of AI. Bachelor's degree 2024-04-19T04:22:51Z 2024-04-19T04:22:51Z 2024 Final Year Project (FYP) Luong, H. N. (2024). Visual dialog system. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175080 https://hdl.handle.net/10356/175080 en SCSE23-0215 application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Visual dialog
spellingShingle	Computer and Information Science Visual dialog Luong, Hien Nga Visual dialog system
description	In this era of Artificial Intelligence, Large Language Models (LLMs) have emerged as powerful tools, facilitating the revolution of natural language understanding and generation tasks across different domains. Many innovations such as OpenAI’s Generative Pretrained Transformer (GPT) series have proven their outstanding ability to comprehend and generate coherent text. The continuous evolution of Artificial Intelligence has led to advancement beyond linguistic abilities, enabling integration of multimodal functionalities. Multimodal Large Language Models (MLLMs) demonstrate a remarkable development by extending the capabilities of LLMs to visual and auditory information. With this modality integration, models are able to process and comprehend more diverse input channels, such as images, audios, or videos. Among the modalities, images stand out as the most utilised mean of communication, attracting a large volume of research and development to visual language models. With an incredible proficiency in visual comprehension and reasoning, MLLMs can serve as a significant aid in practical applications, including image captioning and visual question answering. Understanding the need to make the powerful MLLMs accessible to general users, development of a user-friendly Visual Dialog System becomes pivotal. Serving as a bridge between users and MLLMs, this system can facilitate seamless multi-round conversation involving images and texts. Additionally, to give MLLM necessary contextual information to ensure smooth multi-round conversation, a proper instructional prompting scheme is important. This project aims to develop a visual dialog system employing MLLM with an appropriate prompting scheme and web User Interface (UI) that integrates textual and visual elements cohesively, allowing interactive conversations between state-of-the-art MLLMs and users. The initial step involves providing model with historical information to ensure a smooth multi-round conversation. Subsequently, a UI needs to be created with interactive components allowing users to input images, queries and receive responses from the MLLM. In this project, prompting combining the new question with the summarisation of two previous answers yields an impressive result in increasing users' satisfaction by nearly 50\% compared to no contextual prompting, highlighting its potential as a promising cost-efficient contextual provision at inference time for visual dialog system. The outcome of this project potentially lays a groundwork for further domain-specific applications, including education, content creation or virtual assistants, where visual dialogs play crucial role in helping humans utilise the powerfulness of AI.
author2	Hanwang Zhang
author_facet	Hanwang Zhang Luong, Hien Nga
format	Final Year Project
author	Luong, Hien Nga
author_sort	Luong, Hien Nga
title	Visual dialog system
title_short	Visual dialog system
title_full	Visual dialog system
title_fullStr	Visual dialog system
title_full_unstemmed	Visual dialog system
title_sort	visual dialog system
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/175080
_version_	1806059740813328384

Visual dialog system

Similar Items