Build autonomous agents with multimodal knowledge
Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA,...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/177298 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-177298 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1772982024-05-31T15:44:14Z Build autonomous agents with multimodal knowledge Tian, Shu Lin Liu Ziwei Wen Bihan School of Electrical and Electronic Engineering bihan.wen@ntu.edu.sg, ziwei.liu@ntu.edu.sg Computer and Information Science Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,047 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to autonomously extract multimodal information from web pages as observations; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating agent’s progress in completing multihop tasks. We experiment with both standalone (multimodal) language models and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improved both singlehop and multihop web browsing abilities of agents. Our code and data will be publicly available. Bachelor's degree 2024-05-27T06:28:08Z 2024-05-27T06:28:08Z 2024 Final Year Project (FYP) Tian, S. L. (2024). Build autonomous agents with multimodal knowledge. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/177298 https://hdl.handle.net/10356/177298 en application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science |
spellingShingle |
Computer and Information Science Tian, Shu Lin Build autonomous agents with multimodal knowledge |
description |
Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,047 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to autonomously extract multimodal information from web pages as observations; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating agent’s progress in completing multihop tasks. We experiment with both standalone (multimodal) language models and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improved both singlehop and multihop web browsing abilities of agents. Our code and data will be publicly available. |
author2 |
Liu Ziwei |
author_facet |
Liu Ziwei Tian, Shu Lin |
format |
Final Year Project |
author |
Tian, Shu Lin |
author_sort |
Tian, Shu Lin |
title |
Build autonomous agents with multimodal knowledge |
title_short |
Build autonomous agents with multimodal knowledge |
title_full |
Build autonomous agents with multimodal knowledge |
title_fullStr |
Build autonomous agents with multimodal knowledge |
title_full_unstemmed |
Build autonomous agents with multimodal knowledge |
title_sort |
build autonomous agents with multimodal knowledge |
publisher |
Nanyang Technological University |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/177298 |
_version_ |
1800916447311953920 |