Intelligent robot grasp planning with multimodal large language model
Autonomous robot grasping in multi-object scenarios poses significant challenges, requiring precise grasp candidate detection, determination of object-grasp affiliations, and reasoning about inter-object relationships to minimize collisions and collapses. This research presents a novel approach to a...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/176474 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Autonomous robot grasping in multi-object scenarios poses significant challenges, requiring precise grasp candidate detection, determination of object-grasp affiliations, and reasoning about inter-object relationships to minimize collisions and collapses. This research presents a novel approach to address these challenges by developing a dedicated grasp detection model called GraspAnything, and integrating it with a multimodal large language model (MLLM) for open-vocabulary object detection and grasp sequence planning.
The GraspAnything model, based on Meta's SegmentAnything (SAM) model, receives bounding boxes as prompts and simultaneously outputs the object's mask and all possible grasp poses for parallel jaw gripper. A grasp decoder module is added to the SAM model to enable grasp detection functionality. The MLLM, based on Apple's Ferret model, understands images and location referencing by converting input boxes or masks into region feature tokens. With 7B parameters, the MLLM possesses the intelligence to reason about spatial relationships between objects and directly output bounding box coordinates for precise open-vocabulary object detection.
Both models are fine-tuned on a custom dataset to fit the proposed task. The results demonstrate that the GraspAnything model can detect grasps precisely for each object, with high accuracy in grasp affiliations due to the internal connection between objects and grasps during model inference. Combined with the MLLM, the system enables open-vocabulary grasping with autonomous sequence planning.
The implications of this research extend to various industrial applications, such as object picking and sorting, where intelligent robot grasping can significantly enhance efficiency and automation. The developed models and approaches contribute to the advancement of autonomous robot grasping in complex, multi-object environments. |
---|