GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions
- URL: http://arxiv.org/abs/2503.16013v1
- Date: Thu, 20 Mar 2025 10:32:38 GMT
- Title: GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions
- Authors: Xiaomeng Chu, Jiajun Deng, Guoliang You, Wei Liu, Xingchen Li, Jianmin Ji, Yanyong Zhang,
- Abstract summary: We propose a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties.<n>We present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands.
- Score: 24.947855662285015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users' intentions in the instructions. However, the LLM's knowledge about objects' physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary question-answering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. Codes and data will be released.
Related papers
- IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.
We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.
We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z) - Unlocking the Capabilities of Vision-Language Models for Generalizable and Explainable Deepfake Detection [18.125287697902813]
Current vision-language models (VLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection.<n>We present a novel paradigm that unlocksVLMs' potential capabilities through three components.
arXiv Detail & Related papers (2025-03-19T03:20:03Z) - Cognitive Disentanglement for Referring Multi-Object Tracking [28.325814292139686]
Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references.<n>Existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features.<n>We propose a Cognitive Disentanglement for Referring Multi-Object Tracking framework that adapts the "what" and "where" pathways from human visual processing system to RMOT tasks.
arXiv Detail & Related papers (2025-03-14T15:21:54Z) - REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding [36.376220619032225]
REF-VLM is an end-to-end framework for unified training of various visual decoding tasks.
We construct a large-scale multi-task dataset containing over 100 million multimodal dialogue samples.
REF-VLM outperforms other MLLMs across a variety of standard benchmarks.
arXiv Detail & Related papers (2025-03-10T14:59:14Z) - 3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds [81.14476072159049]
3D Affordance detection is a challenging problem with broad applications on various robotic tasks.
We reformulate the traditional affordance detection paradigm into textit Reasoning Affordance (IRAS) task.
We propose 3D-ADLLM, a framework designed for reasoning affordance detection in 3D open-scene.
arXiv Detail & Related papers (2025-02-27T12:29:44Z) - From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning [71.41062111470414]
The proposed plug-and-play framework interfaces with any open-vocabulary detector.
At its core, our approach combines (i) a symbolic regression mechanism exploring relationship patterns among detected entities.
We compared our training-free framework against specialized event recognition systems across diverse application domains.
arXiv Detail & Related papers (2025-02-09T10:30:54Z) - Large Language Models Understand Layout [6.732578061359833]
Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks.
We show that, beyond text understanding capability, LLMs are capable of processing text layouts denoted by spatial markers.
We show that layout understanding ability is beneficial for building efficient visual question-answering (VQA) systems.
arXiv Detail & Related papers (2024-07-08T09:03:12Z) - ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension [71.03445074045092]
We propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens.<n>Our method unifies the prompt and answer of visual referential tasks without using additional syntax.<n>ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency.
arXiv Detail & Related papers (2024-06-17T08:39:16Z) - PointLLM: Empowering Large Language Models to Understand Point Clouds [63.39876878899682]
PointLLM understands colored object point clouds with human instructions.
It generates contextually appropriate responses, illustrating its grasp of point clouds and common sense.
arXiv Detail & Related papers (2023-08-31T17:59:46Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for
Document Information Extraction [56.790794611002106]
Large language models (LLMs) have demonstrated remarkable results in various natural language processing (NLP) tasks with in-context learning.
We propose a simple but effective in-context learning framework called ICL-D3IE.
Specifically, we extract the most difficult and distinct segments from hard training documents as hard demonstrations.
arXiv Detail & Related papers (2023-03-09T06:24:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.