Zero-shot object detection and referring expression comprehension using vision-language models

This project focused on constructing a comprehensive perception pipeline integrating Natural Language Processing (NLP), zero-shot object detection, and Referring Expression Comprehension (ReC) within a ROS (Robot Operating System) framework. The aim was to enhance robotic assistive devices in accura...

Full description

Saved in:
Bibliographic Details
Main Author: A Manicka, Praveen
Other Authors: Ang Wei Tech
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/177827
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:This project focused on constructing a comprehensive perception pipeline integrating Natural Language Processing (NLP), zero-shot object detection, and Referring Expression Comprehension (ReC) within a ROS (Robot Operating System) framework. The aim was to enhance robotic assistive devices in accurately interpreting natural language commands and grounding language to physical objects in the real world. To achieve this, we compared various combinations of zero-shot object detectors and ReC models, specifically specifically OWL-ViT and Grounding DINO for zero-shot object detection; and ReCLIP and GPT-4 for ReC. Our evaluation assessed the models' capabilities in counting, spatial reasoning, understanding superlatives, handling multiple instances, self-referential comprehension, and identifying household objects. The findings were showed that GPT-4 outperformed ReCLIP as for the purpose of ReC, and the combination of Grounding DINO and GPT-4 proved to be the best zero-shot object detector and ReC pair.