GLOVER: Generalizable Open-Vocabulary Affordance Reasoning for Task-Oriented Grasping
- URL: http://arxiv.org/abs/2411.12286v2
- Date: Thu, 01 May 2025 09:13:09 GMT
- Title: GLOVER: Generalizable Open-Vocabulary Affordance Reasoning for Task-Oriented Grasping
- Authors: Teli Ma, Zifan Wang, Jiaming Zhou, Mengmeng Wang, Junwei Liang,
- Abstract summary: We propose a unified Generalizable Open-Vocabulary Affordance Reasoning framework, which fine-tunes the Large Language Models (LLMs) to predict the visual affordance of graspable object parts within RGB feature space.<n>GLOVER inherits world knowledge and common-sense reasoning from LLMs, facilitating more fine-grained object understanding and sophisticated tool-use reasoning.<n>In evaluations across 30 table-top real-world scenes, GLOVER achieves success rates of 86.4% in part identification and 76.3% in grasping, with speeds approximately 29 times faster in affordance reasoning and 40 times faster in grasping pose estimation.
- Score: 23.677556075872793
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Inferring affordable (i.e., graspable) parts of arbitrary objects based on human specifications is essential for robots advancing toward open-vocabulary manipulation. Current grasp planners, however, are hindered by limited vision-language comprehension and time-consuming 3D radiance modeling, restricting real-time, open-vocabulary interactions with objects. To address these limitations, we propose GLOVER, a unified Generalizable Open-Vocabulary Affordance Reasoning framework, which fine-tunes the Large Language Models (LLMs) to predict the visual affordance of graspable object parts within RGB feature space. We compile a dataset of over 10,000 images from human-object interactions, annotated with unified visual and linguistic affordance labels, to enable multi-modal fine-tuning. GLOVER inherits world knowledge and common-sense reasoning from LLMs, facilitating more fine-grained object understanding and sophisticated tool-use reasoning. To enable effective real-world deployment, we present Affordance-Aware Grasping Estimation (AGE), a non-parametric grasp planner that aligns the gripper pose with a superquadric surface derived from affordance data. In evaluations across 30 table-top real-world scenes, GLOVER achieves success rates of 86.0% in part identification and 76.3% in grasping, with speeds approximately 29 times faster in affordance reasoning and 40 times faster in grasping pose estimation than the previous state-of-the-art. We also validate the generalization across embodiments, showing effectiveness in humanoid robots with dexterous hands.
Related papers
- Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control [22.74768543283102]
Graph-Fused Vision-Language-Action (GF-VLA) is a framework that enables dual-arm robotic systems to perform task-level reasoning and execution.<n>GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance.<n>Cross-hand selection policy infers optimal assignment without explicit geometric reasoning.
arXiv Detail & Related papers (2025-08-07T12:48:09Z) - RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping [101.22617426879079]
We build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet.<n>The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data.<n>We propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target.
arXiv Detail & Related papers (2025-07-31T17:17:05Z) - Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale [41.693908591580175]
We develop vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder.<n>Our models achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization.
arXiv Detail & Related papers (2025-06-13T17:57:18Z) - ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning [68.4209681278336]
Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions.
Current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals.
We propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping.
arXiv Detail & Related papers (2025-03-30T03:40:35Z) - 3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds [81.14476072159049]
3D Affordance detection is a challenging problem with broad applications on various robotic tasks.
We reformulate the traditional affordance detection paradigm into textit Reasoning Affordance (IRAS) task.
We propose 3D-ADLLM, a framework designed for reasoning affordance detection in 3D open-scene.
arXiv Detail & Related papers (2025-02-27T12:29:44Z) - Seeing with Partial Certainty: Conformal Prediction for Robotic Scene Recognition in Built Environments [6.295098866364597]
We introduce Seeing with Partial Certainty (SwPC) - a framework designed to measure and align uncertainty in VLM-based place recognition.
SwPC is built on the theory of conformal prediction to provide statistical guarantees on place recognition while minimizing requests for human help.
arXiv Detail & Related papers (2025-01-09T03:50:00Z) - Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding [85.63710017456792]
FuSe is a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities.
We show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound.
Experiments in the real world show that FuSeis able to increase success rates by over 20% compared to all considered baselines.
arXiv Detail & Related papers (2025-01-08T18:57:33Z) - Structured Spatial Reasoning with Open Vocabulary Object Detectors [2.089191490381739]
Reasoning about spatial relationships between objects is essential for many real-world robotic tasks.
We introduce a structured probabilistic approach that integrates rich 3D geometric features with state-of-the-art open-vocabulary object detectors.
The approach is evaluated and compared against zero-shot performance of the state-of-the-art Vision and Language Models (VLMs) on spatial reasoning tasks.
arXiv Detail & Related papers (2024-10-09T19:37:01Z) - Towards Open-World Grasping with Large Vision-Language Models [5.317624228510749]
An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning.
We propose OWG, an open-world grasping pipeline that combines vision-language models with segmentation and grasp synthesis models.
We conduct evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language.
arXiv Detail & Related papers (2024-06-26T19:42:08Z) - OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding [21.64446104872021]
We introduce Open, an innovative approach to build open-vocabulary object-level Neural Fields with fine-grained understanding.
In essence, Open establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level.
The results on multiple datasets demonstrate that Open achieves superior performance in zero-shot semantic and retrieval tasks.
arXiv Detail & Related papers (2024-06-12T08:59:33Z) - Is CLIP the main roadblock for fine-grained open-world perception? [7.190567053576658]
Recent studies highlighted limitations on the fine-grained recognition capabilities in open-vocabulary settings.
We show that the lack of fine-grained understanding is caused by the poor separability of object characteristics in the CLIP latent space.
Our experiments show that simple CLIP latent-space re-projections help separate fine-grained concepts.
arXiv Detail & Related papers (2024-04-04T15:47:30Z) - GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields [50.68719394443926]
Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF) is a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics.
GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-04-01T05:19:50Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - YOLO-World: Real-Time Open-Vocabulary Object Detection [87.08732047660058]
We introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities.
Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency.
YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed.
arXiv Detail & Related papers (2024-01-30T18:59:38Z) - Detecting Any Human-Object Interaction Relationship: Universal HOI
Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs)
Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image.
For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.