Intelligent robot grasp planning with multimodal large language model

Autonomous robot grasping in multi-object scenarios poses significant challenges, requiring precise grasp candidate detection, determination of object-grasp affiliations, and reasoning about inter-object relationships to minimize collisions and collapses. This research presents a novel approach to a...

Full description

Saved in:
Bibliographic Details
Main Author: Liu, Songting
Other Authors: Lin Zhiping
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/176474
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-176474
record_format dspace
spelling sg-ntu-dr.10356-1764742024-05-17T15:45:35Z Intelligent robot grasp planning with multimodal large language model Liu, Songting Lin Zhiping School of Electrical and Electronic Engineering EZPLin@ntu.edu.sg Computer and Information Science Robot grasping Deep learning Autonomous robot grasping in multi-object scenarios poses significant challenges, requiring precise grasp candidate detection, determination of object-grasp affiliations, and reasoning about inter-object relationships to minimize collisions and collapses. This research presents a novel approach to address these challenges by developing a dedicated grasp detection model called GraspAnything, and integrating it with a multimodal large language model (MLLM) for open-vocabulary object detection and grasp sequence planning. The GraspAnything model, based on Meta's SegmentAnything (SAM) model, receives bounding boxes as prompts and simultaneously outputs the object's mask and all possible grasp poses for parallel jaw gripper. A grasp decoder module is added to the SAM model to enable grasp detection functionality. The MLLM, based on Apple's Ferret model, understands images and location referencing by converting input boxes or masks into region feature tokens. With 7B parameters, the MLLM possesses the intelligence to reason about spatial relationships between objects and directly output bounding box coordinates for precise open-vocabulary object detection. Both models are fine-tuned on a custom dataset to fit the proposed task. The results demonstrate that the GraspAnything model can detect grasps precisely for each object, with high accuracy in grasp affiliations due to the internal connection between objects and grasps during model inference. Combined with the MLLM, the system enables open-vocabulary grasping with autonomous sequence planning. The implications of this research extend to various industrial applications, such as object picking and sorting, where intelligent robot grasping can significantly enhance efficiency and automation. The developed models and approaches contribute to the advancement of autonomous robot grasping in complex, multi-object environments. Bachelor's degree 2024-05-17T01:54:45Z 2024-05-17T01:54:45Z 2024 Final Year Project (FYP) Liu, S. (2024). Intelligent robot grasp planning with multimodal large language model. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/176474 https://hdl.handle.net/10356/176474 en B3114-231 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Robot grasping
Deep learning
spellingShingle Computer and Information Science
Robot grasping
Deep learning
Liu, Songting
Intelligent robot grasp planning with multimodal large language model
description Autonomous robot grasping in multi-object scenarios poses significant challenges, requiring precise grasp candidate detection, determination of object-grasp affiliations, and reasoning about inter-object relationships to minimize collisions and collapses. This research presents a novel approach to address these challenges by developing a dedicated grasp detection model called GraspAnything, and integrating it with a multimodal large language model (MLLM) for open-vocabulary object detection and grasp sequence planning. The GraspAnything model, based on Meta's SegmentAnything (SAM) model, receives bounding boxes as prompts and simultaneously outputs the object's mask and all possible grasp poses for parallel jaw gripper. A grasp decoder module is added to the SAM model to enable grasp detection functionality. The MLLM, based on Apple's Ferret model, understands images and location referencing by converting input boxes or masks into region feature tokens. With 7B parameters, the MLLM possesses the intelligence to reason about spatial relationships between objects and directly output bounding box coordinates for precise open-vocabulary object detection. Both models are fine-tuned on a custom dataset to fit the proposed task. The results demonstrate that the GraspAnything model can detect grasps precisely for each object, with high accuracy in grasp affiliations due to the internal connection between objects and grasps during model inference. Combined with the MLLM, the system enables open-vocabulary grasping with autonomous sequence planning. The implications of this research extend to various industrial applications, such as object picking and sorting, where intelligent robot grasping can significantly enhance efficiency and automation. The developed models and approaches contribute to the advancement of autonomous robot grasping in complex, multi-object environments.
author2 Lin Zhiping
author_facet Lin Zhiping
Liu, Songting
format Final Year Project
author Liu, Songting
author_sort Liu, Songting
title Intelligent robot grasp planning with multimodal large language model
title_short Intelligent robot grasp planning with multimodal large language model
title_full Intelligent robot grasp planning with multimodal large language model
title_fullStr Intelligent robot grasp planning with multimodal large language model
title_full_unstemmed Intelligent robot grasp planning with multimodal large language model
title_sort intelligent robot grasp planning with multimodal large language model
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/176474
_version_ 1800916249298862080