Intelligent robot grasp planning with multimodal large language model

Autonomous robot grasping in multi-object scenarios poses significant challenges, requiring precise grasp candidate detection, determination of object-grasp affiliations, and reasoning about inter-object relationships to minimize collisions and collapses. This research presents a novel approach to a...

Full description

Saved in:

Bibliographic Details
Main Author:	Liu, Songting
Other Authors:	Lin Zhiping
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science Robot grasping Deep learning
Online Access:	https://hdl.handle.net/10356/176474
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-176474
record_format	dspace
spelling	sg-ntu-dr.10356-1764742024-05-17T15:45:35Z Intelligent robot grasp planning with multimodal large language model Liu, Songting Lin Zhiping School of Electrical and Electronic Engineering EZPLin@ntu.edu.sg Computer and Information Science Robot grasping Deep learning Autonomous robot grasping in multi-object scenarios poses significant challenges, requiring precise grasp candidate detection, determination of object-grasp affiliations, and reasoning about inter-object relationships to minimize collisions and collapses. This research presents a novel approach to address these challenges by developing a dedicated grasp detection model called GraspAnything, and integrating it with a multimodal large language model (MLLM) for open-vocabulary object detection and grasp sequence planning. The GraspAnything model, based on Meta's SegmentAnything (SAM) model, receives bounding boxes as prompts and simultaneously outputs the object's mask and all possible grasp poses for parallel jaw gripper. A grasp decoder module is added to the SAM model to enable grasp detection functionality. The MLLM, based on Apple's Ferret model, understands images and location referencing by converting input boxes or masks into region feature tokens. With 7B parameters, the MLLM possesses the intelligence to reason about spatial relationships between objects and directly output bounding box coordinates for precise open-vocabulary object detection. Both models are fine-tuned on a custom dataset to fit the proposed task. The results demonstrate that the GraspAnything model can detect grasps precisely for each object, with high accuracy in grasp affiliations due to the internal connection between objects and grasps during model inference. Combined with the MLLM, the system enables open-vocabulary grasping with autonomous sequence planning. The implications of this research extend to various industrial applications, such as object picking and sorting, where intelligent robot grasping can significantly enhance efficiency and automation. The developed models and approaches contribute to the advancement of autonomous robot grasping in complex, multi-object environments. Bachelor's degree 2024-05-17T01:54:45Z 2024-05-17T01:54:45Z 2024 Final Year Project (FYP) Liu, S. (2024). Intelligent robot grasp planning with multimodal large language model. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/176474 https://hdl.handle.net/10356/176474 en B3114-231 application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Robot grasping Deep learning
spellingShingle	Computer and Information Science Robot grasping Deep learning Liu, Songting Intelligent robot grasp planning with multimodal large language model
description	Autonomous robot grasping in multi-object scenarios poses significant challenges, requiring precise grasp candidate detection, determination of object-grasp affiliations, and reasoning about inter-object relationships to minimize collisions and collapses. This research presents a novel approach to address these challenges by developing a dedicated grasp detection model called GraspAnything, and integrating it with a multimodal large language model (MLLM) for open-vocabulary object detection and grasp sequence planning. The GraspAnything model, based on Meta's SegmentAnything (SAM) model, receives bounding boxes as prompts and simultaneously outputs the object's mask and all possible grasp poses for parallel jaw gripper. A grasp decoder module is added to the SAM model to enable grasp detection functionality. The MLLM, based on Apple's Ferret model, understands images and location referencing by converting input boxes or masks into region feature tokens. With 7B parameters, the MLLM possesses the intelligence to reason about spatial relationships between objects and directly output bounding box coordinates for precise open-vocabulary object detection. Both models are fine-tuned on a custom dataset to fit the proposed task. The results demonstrate that the GraspAnything model can detect grasps precisely for each object, with high accuracy in grasp affiliations due to the internal connection between objects and grasps during model inference. Combined with the MLLM, the system enables open-vocabulary grasping with autonomous sequence planning. The implications of this research extend to various industrial applications, such as object picking and sorting, where intelligent robot grasping can significantly enhance efficiency and automation. The developed models and approaches contribute to the advancement of autonomous robot grasping in complex, multi-object environments.
author2	Lin Zhiping
author_facet	Lin Zhiping Liu, Songting
format	Final Year Project
author	Liu, Songting
author_sort	Liu, Songting
title	Intelligent robot grasp planning with multimodal large language model
title_short	Intelligent robot grasp planning with multimodal large language model
title_full	Intelligent robot grasp planning with multimodal large language model
title_fullStr	Intelligent robot grasp planning with multimodal large language model
title_full_unstemmed	Intelligent robot grasp planning with multimodal large language model
title_sort	intelligent robot grasp planning with multimodal large language model
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/176474
_version_	1800916249298862080

Intelligent robot grasp planning with multimodal large language model

Similar Items