Embodied object hunt

This study investigates the use of multimodal encoders in the Embodied Object Hunt task. The motivation behind this approach is recent developments in joint multimodal encoders such as CLIP that are able to extract common features between images and text. This ability is ideal for tasks combining...

Full description

Saved in:
Bibliographic Details
Main Author: Kam, Rainer I-Wen
Other Authors: Cham Tat Jen
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175084
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:This study investigates the use of multimodal encoders in the Embodied Object Hunt task. The motivation behind this approach is recent developments in joint multimodal encoders such as CLIP that are able to extract common features between images and text. This ability is ideal for tasks combining imagery and text, such as the Embodied Object Hunt using visual observations and textual input prompts. This study also explores using intrinsic curiosity rewards to supplement agent learning, encouraging agents to explore their environment and facilitate learning. This study compares agents trained using CLIP embeddings and intrinsic curiosity and those without, and analyzes the key differences between their training results. The results of this study can be used to understand the effectiveness and feasibility of using different approaches to train embodied agents, serving as an exploratory study that future improvements can be based upon.