Intelligent robot grasp planning with multimodal large language model
Autonomous robot grasping in multi-object scenarios poses significant challenges, requiring precise grasp candidate detection, determination of object-grasp affiliations, and reasoning about inter-object relationships to minimize collisions and collapses. This research presents a novel approach to a...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/176474 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-176474 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1764742024-05-17T15:45:35Z Intelligent robot grasp planning with multimodal large language model Liu, Songting Lin Zhiping School of Electrical and Electronic Engineering EZPLin@ntu.edu.sg Computer and Information Science Robot grasping Deep learning Autonomous robot grasping in multi-object scenarios poses significant challenges, requiring precise grasp candidate detection, determination of object-grasp affiliations, and reasoning about inter-object relationships to minimize collisions and collapses. This research presents a novel approach to address these challenges by developing a dedicated grasp detection model called GraspAnything, and integrating it with a multimodal large language model (MLLM) for open-vocabulary object detection and grasp sequence planning. The GraspAnything model, based on Meta's SegmentAnything (SAM) model, receives bounding boxes as prompts and simultaneously outputs the object's mask and all possible grasp poses for parallel jaw gripper. A grasp decoder module is added to the SAM model to enable grasp detection functionality. The MLLM, based on Apple's Ferret model, understands images and location referencing by converting input boxes or masks into region feature tokens. With 7B parameters, the MLLM possesses the intelligence to reason about spatial relationships between objects and directly output bounding box coordinates for precise open-vocabulary object detection. Both models are fine-tuned on a custom dataset to fit the proposed task. The results demonstrate that the GraspAnything model can detect grasps precisely for each object, with high accuracy in grasp affiliations due to the internal connection between objects and grasps during model inference. Combined with the MLLM, the system enables open-vocabulary grasping with autonomous sequence planning. The implications of this research extend to various industrial applications, such as object picking and sorting, where intelligent robot grasping can significantly enhance efficiency and automation. The developed models and approaches contribute to the advancement of autonomous robot grasping in complex, multi-object environments. Bachelor's degree 2024-05-17T01:54:45Z 2024-05-17T01:54:45Z 2024 Final Year Project (FYP) Liu, S. (2024). Intelligent robot grasp planning with multimodal large language model. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/176474 https://hdl.handle.net/10356/176474 en B3114-231 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Robot grasping Deep learning |
spellingShingle |
Computer and Information Science Robot grasping Deep learning Liu, Songting Intelligent robot grasp planning with multimodal large language model |
description |
Autonomous robot grasping in multi-object scenarios poses significant challenges, requiring precise grasp candidate detection, determination of object-grasp affiliations, and reasoning about inter-object relationships to minimize collisions and collapses. This research presents a novel approach to address these challenges by developing a dedicated grasp detection model called GraspAnything, and integrating it with a multimodal large language model (MLLM) for open-vocabulary object detection and grasp sequence planning.
The GraspAnything model, based on Meta's SegmentAnything (SAM) model, receives bounding boxes as prompts and simultaneously outputs the object's mask and all possible grasp poses for parallel jaw gripper. A grasp decoder module is added to the SAM model to enable grasp detection functionality. The MLLM, based on Apple's Ferret model, understands images and location referencing by converting input boxes or masks into region feature tokens. With 7B parameters, the MLLM possesses the intelligence to reason about spatial relationships between objects and directly output bounding box coordinates for precise open-vocabulary object detection.
Both models are fine-tuned on a custom dataset to fit the proposed task. The results demonstrate that the GraspAnything model can detect grasps precisely for each object, with high accuracy in grasp affiliations due to the internal connection between objects and grasps during model inference. Combined with the MLLM, the system enables open-vocabulary grasping with autonomous sequence planning.
The implications of this research extend to various industrial applications, such as object picking and sorting, where intelligent robot grasping can significantly enhance efficiency and automation. The developed models and approaches contribute to the advancement of autonomous robot grasping in complex, multi-object environments. |
author2 |
Lin Zhiping |
author_facet |
Lin Zhiping Liu, Songting |
format |
Final Year Project |
author |
Liu, Songting |
author_sort |
Liu, Songting |
title |
Intelligent robot grasp planning with multimodal large language model |
title_short |
Intelligent robot grasp planning with multimodal large language model |
title_full |
Intelligent robot grasp planning with multimodal large language model |
title_fullStr |
Intelligent robot grasp planning with multimodal large language model |
title_full_unstemmed |
Intelligent robot grasp planning with multimodal large language model |
title_sort |
intelligent robot grasp planning with multimodal large language model |
publisher |
Nanyang Technological University |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/176474 |
_version_ |
1800916249298862080 |