Visual dialog system

In this era of Artificial Intelligence, Large Language Models (LLMs) have emerged as powerful tools, facilitating the revolution of natural language understanding and generation tasks across different domains. Many innovations such as OpenAI’s Generative Pretrained Transformer (GPT) series have prov...

Full description

Saved in:
Bibliographic Details
Main Author: Luong, Hien Nga
Other Authors: Hanwang Zhang
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175080
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-175080
record_format dspace
spelling sg-ntu-dr.10356-1750802024-04-19T15:45:28Z Visual dialog system Luong, Hien Nga Hanwang Zhang School of Computer Science and Engineering hanwangzhang@ntu.edu.sg Computer and Information Science Visual dialog In this era of Artificial Intelligence, Large Language Models (LLMs) have emerged as powerful tools, facilitating the revolution of natural language understanding and generation tasks across different domains. Many innovations such as OpenAI’s Generative Pretrained Transformer (GPT) series have proven their outstanding ability to comprehend and generate coherent text. The continuous evolution of Artificial Intelligence has led to advancement beyond linguistic abilities, enabling integration of multimodal functionalities. Multimodal Large Language Models (MLLMs) demonstrate a remarkable development by extending the capabilities of LLMs to visual and auditory information. With this modality integration, models are able to process and comprehend more diverse input channels, such as images, audios, or videos. Among the modalities, images stand out as the most utilised mean of communication, attracting a large volume of research and development to visual language models. With an incredible proficiency in visual comprehension and reasoning, MLLMs can serve as a significant aid in practical applications, including image captioning and visual question answering.  Understanding the need to make the powerful MLLMs accessible to general users, development of a user-friendly Visual Dialog System becomes pivotal. Serving as a bridge between users and MLLMs, this system can facilitate seamless multi-round conversation involving images and texts. Additionally, to give MLLM necessary contextual information to ensure smooth multi-round conversation, a proper instructional prompting scheme is important. This project aims to develop a visual dialog system employing MLLM with an appropriate prompting scheme and web User Interface (UI) that integrates textual and visual elements cohesively, allowing interactive conversations between state-of-the-art MLLMs and users. The initial step involves providing model with historical information to ensure a smooth multi-round conversation. Subsequently, a UI needs to be created with interactive components allowing users to input images, queries and receive responses from the MLLM. In this project, prompting combining the new question with the summarisation of two previous answers yields an impressive result in increasing users' satisfaction by nearly 50\% compared to no contextual prompting, highlighting its potential as a promising cost-efficient contextual provision at inference time for visual dialog system. The outcome of this project potentially lays a groundwork for further domain-specific applications, including education, content creation or virtual assistants, where visual dialogs play crucial role in helping humans utilise the powerfulness of AI. Bachelor's degree 2024-04-19T04:22:51Z 2024-04-19T04:22:51Z 2024 Final Year Project (FYP) Luong, H. N. (2024). Visual dialog system. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175080 https://hdl.handle.net/10356/175080 en SCSE23-0215 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Visual dialog
spellingShingle Computer and Information Science
Visual dialog
Luong, Hien Nga
Visual dialog system
description In this era of Artificial Intelligence, Large Language Models (LLMs) have emerged as powerful tools, facilitating the revolution of natural language understanding and generation tasks across different domains. Many innovations such as OpenAI’s Generative Pretrained Transformer (GPT) series have proven their outstanding ability to comprehend and generate coherent text. The continuous evolution of Artificial Intelligence has led to advancement beyond linguistic abilities, enabling integration of multimodal functionalities. Multimodal Large Language Models (MLLMs) demonstrate a remarkable development by extending the capabilities of LLMs to visual and auditory information. With this modality integration, models are able to process and comprehend more diverse input channels, such as images, audios, or videos. Among the modalities, images stand out as the most utilised mean of communication, attracting a large volume of research and development to visual language models. With an incredible proficiency in visual comprehension and reasoning, MLLMs can serve as a significant aid in practical applications, including image captioning and visual question answering.  Understanding the need to make the powerful MLLMs accessible to general users, development of a user-friendly Visual Dialog System becomes pivotal. Serving as a bridge between users and MLLMs, this system can facilitate seamless multi-round conversation involving images and texts. Additionally, to give MLLM necessary contextual information to ensure smooth multi-round conversation, a proper instructional prompting scheme is important. This project aims to develop a visual dialog system employing MLLM with an appropriate prompting scheme and web User Interface (UI) that integrates textual and visual elements cohesively, allowing interactive conversations between state-of-the-art MLLMs and users. The initial step involves providing model with historical information to ensure a smooth multi-round conversation. Subsequently, a UI needs to be created with interactive components allowing users to input images, queries and receive responses from the MLLM. In this project, prompting combining the new question with the summarisation of two previous answers yields an impressive result in increasing users' satisfaction by nearly 50\% compared to no contextual prompting, highlighting its potential as a promising cost-efficient contextual provision at inference time for visual dialog system. The outcome of this project potentially lays a groundwork for further domain-specific applications, including education, content creation or virtual assistants, where visual dialogs play crucial role in helping humans utilise the powerfulness of AI.
author2 Hanwang Zhang
author_facet Hanwang Zhang
Luong, Hien Nga
format Final Year Project
author Luong, Hien Nga
author_sort Luong, Hien Nga
title Visual dialog system
title_short Visual dialog system
title_full Visual dialog system
title_fullStr Visual dialog system
title_full_unstemmed Visual dialog system
title_sort visual dialog system
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/175080
_version_ 1806059740813328384