Obstruction reasoning for robotic grasping
- URL: http://arxiv.org/abs/2511.23186v1
- Date: Fri, 28 Nov 2025 13:53:12 GMT
- Title: Obstruction reasoning for robotic grasping
- Authors: Runyu Jiao, Matteo Bortolon, Francesco Giuliari, Alice Fasoli, Sergio Povoli, Guofeng Mei, Yiming Wang, Fabio Poiesi,
- Abstract summary: We present UNOGrasp, a learning-based vision-language model capable of performing obstruction reasoning.<n>We devise a novel multi-step reasoning process based on obstruction paths originated by the target object.<n>We construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2.
- Score: 18.39507400925748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current vision-language embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning-based vision-language model capable of performing visually-grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi-step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction-aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural-language instructions. Extensive experiments and real-robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real-world environments, outperforming generalist and proprietary alternatives. Project website: https://tev-fbk.github.io/UnoGrasp/.
Related papers
- Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning [16.274791437311602]
We introduce ARMOR: Adaptive Round-based Multi-task mOdel for Robotic failure detection and reasoning.<n>We formulate detection and reasoning as a multi-task self-refinement process.<n>We show that ARMOR achieves state-of-the-art performance by improving over the previous approaches by up to 30% on failure detection rate.
arXiv Detail & Related papers (2026-02-12T20:55:36Z) - Structured Reasoning for Large Language Models [59.215789462977206]
We propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components.<n>SCR substantially improves reasoning efficiency and self-verification.<n>Compared with existing reasoning paradigms, it reduces output token length by up to 50%.
arXiv Detail & Related papers (2026-01-12T04:04:01Z) - Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark [112.46338388724116]
We introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path.<n>To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training.
arXiv Detail & Related papers (2025-12-04T18:55:34Z) - Think Visually, Reason Textually: Vision-Language Synergy in ARC [94.15522924153264]
ARC-AGI is a rigorous testbed for conceptual rule induction and transfer to novel tasks.<n>Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction.<n>We introduce two synergistic strategies: Vision-Language Synergy Reasoning (VLSR) and Modality-Switch Self-Correction (MSSC)<n>Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence.
arXiv Detail & Related papers (2025-11-19T18:59:04Z) - VCoT-Grasp: Grasp Foundation Models with Visual Chain-of-Thought Reasoning for Language-driven Grasp Generation [67.98487725287835]
VCoT-Grasp is an end-to-end grasp foundation model that incorporates visual chain-of-thought reasoning to enhance visual understanding for grasp generation.<n>For training, we refine and introduce a large-scale dataset, VCoT-GraspSet, comprising 167K synthetic images with over 1.36M grasps.<n>Our method significantly improves grasp success rates and generalizes effectively to unseen objects, backgrounds, and distractors.
arXiv Detail & Related papers (2025-10-07T11:50:26Z) - Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models [67.87579664988199]
TON is a two-stage training strategy for vision-language models (VLMs)<n>It introduces a think-or-not format that serves as a cold start for selective reasoning.<n>TON can reduce the completion length by up to 90% compared to vanilla GRPO.
arXiv Detail & Related papers (2025-05-22T16:13:29Z) - DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping [18.410329897882658]
A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios.<n>We present DexGraspVLA, a hierarchical framework for robust generalization in language-guided general dexterous grasping.
arXiv Detail & Related papers (2025-02-28T09:57:20Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.