VISO-Grasp: Vision-Language Informed Spatial Object-centric 6-DoF Active View Planning and Grasping in Clutter and Invisibility
- URL: http://arxiv.org/abs/2503.12609v2
- Date: Wed, 06 Aug 2025 10:19:50 GMT
- Title: VISO-Grasp: Vision-Language Informed Spatial Object-centric 6-DoF Active View Planning and Grasping in Clutter and Invisibility
- Authors: Yitian Shi, Di Wen, Guanqi Chen, Edgar Welte, Sheng Liu, Kunyu Peng, Rainer Stiefelhagen, Rania Rayyes,
- Abstract summary: VISO-Grasp is a vision-informed system designed to address visibility constraints for grasping in severely occluded environments.<n>We introduce a multi-view uncertainty-driven grasp fusion mechanism that refines grasp confidence and directional uncertainty in real-time.<n>VISO-Grasp achieves a success rate of $87.5%$ in target-oriented grasping with the fewest grasp attempts outperforming baselines.
- Score: 31.50489359729733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose VISO-Grasp, a novel vision-language-informed system designed to systematically address visibility constraints for grasping in severely occluded environments. By leveraging Foundation Models (FMs) for spatial reasoning and active view planning, our framework constructs and updates an instance-centric representation of spatial relationships, enhancing grasp success under challenging occlusions. Furthermore, this representation facilitates active Next-Best-View (NBV) planning and optimizes sequential grasping strategies when direct grasping is infeasible. Additionally, we introduce a multi-view uncertainty-driven grasp fusion mechanism that refines grasp confidence and directional uncertainty in real-time, ensuring robust and stable grasp execution. Extensive real-world experiments demonstrate that VISO-Grasp achieves a success rate of $87.5\%$ in target-oriented grasping with the fewest grasp attempts outperforming baselines. To the best of our knowledge, VISO-Grasp is the first unified framework integrating FMs into target-aware active view planning and 6-DoF grasping in environments with severe occlusions and entire invisibility constraints. Code is available at: https://github.com/YitianShi/vMF-Contact
Related papers
- VLMPlanner: Integrating Visual Language Models with Motion Planning [18.633637485218802]
VLMPlanner is a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images.<n>We develop the Context-Adaptive Inference Gate mechanism that enables the VLM to mimic human driving behavior.
arXiv Detail & Related papers (2025-07-27T16:15:21Z) - DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [56.3802428957899]
We propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling.<n>DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning.<n>Experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks.
arXiv Detail & Related papers (2025-07-06T16:14:29Z) - World-aware Planning Narratives Enhance Large Vision-Language Model Planner [48.97399087613431]
Large Vision-Language Models (LVLMs) show promise for embodied planning tasks but struggle with complex scenarios.<n>We propose World-Aware Planning Narrative Enhancement (WAP), a framework that infuses LVLMs with comprehensive environmental understanding.
arXiv Detail & Related papers (2025-06-26T13:20:55Z) - Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach [83.21177515180564]
We propose a framework that prioritizes natural language understanding and structured reasoning to enhance the agent's global understanding of the environment.<n>Our method outperforms previous approaches, particularly achieving a 44.4% relative improvement in task success rate.
arXiv Detail & Related papers (2025-05-22T09:08:47Z) - Zero-Shot Iterative Formalization and Planning in Partially Observable Environments [11.066479432278301]
We propose PDDLego+, a framework to formalize, plan, grow, and refine PDDL representations in a zero-shot manner.<n>We show that PDDLego+ improves goal reaching success and exhibits robustness against problem complexity.
arXiv Detail & Related papers (2025-05-19T13:58:15Z) - Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning [70.57180215148125]
Visual instruction tuning aims to enable large language models to comprehend the visual world.<n>Existing methods often grapple with the intractable trade-off between accuracy and efficiency.<n>We present LLaVA-Meteor, a novel approach that strategically compresses visual tokens without compromising core information.
arXiv Detail & Related papers (2025-05-17T10:22:29Z) - Seeing is Believing: Belief-Space Planning with Foundation Models as Uncertainty Estimators [34.28879194786174]
Generalizable robotic mobile manipulation in open-world environments poses significant challenges due to long horizons, complex goals, and partial observability.
A promising approach to address these challenges involves planning with a library of parameterized skills, where a task planner sequences these skills to achieve goals specified in structured languages.
This paper introduces a novel framework that leverages vision-language models to estimate uncertainty and facilitate symbolic grounding.
arXiv Detail & Related papers (2025-04-04T07:48:53Z) - GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback [23.48582504679409]
Even state-of-the-art policy models frequently exhibit unstable grasping behaviors.
We introduce GraspCorrect, a plug-and-play module designed to enhance grasp performance through vision-language model-guided feedback.
arXiv Detail & Related papers (2025-03-19T09:25:32Z) - EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks [24.41705039390567]
EmbodiedVSR (Embodied Visual Spatial Reasoning) is a novel framework that integrates dynamic scene graph-guided Chain-of-Thought (CoT) reasoning.<n>Our method enables zero-shot spatial reasoning without task-specific fine-tuning.<n>Experiments demonstrate that our framework significantly outperforms existing MLLM-based methods in accuracy and reasoning coherence.
arXiv Detail & Related papers (2025-03-14T05:06:07Z) - STAR: A Foundation Model-driven Framework for Robust Task Planning and Failure Recovery in Robotic Systems [5.426894918217948]
STAR (Smart Task Adaptation and Recovery) is a novel framework that synergizes Foundation Models (FMs) with dynamically expanding Knowledge Graphs (KGs)<n>FMs offer remarkable generalization and contextual reasoning, but their limitations hinder reliable deployment.<n>We show that STAR demonstrated an 86% task planning accuracy and 78% recovery success rate, showing significant improvements over baseline methods.
arXiv Detail & Related papers (2025-03-08T05:05:21Z) - Platform-Aware Mission Planning [50.56223680851687]
We introduce the problem of Platform-Aware Mission Planning (PAMP), addressing it in the setting of temporal durative actions.<n>The first baseline approach amalgamates the mission and platform levels, while the second is based on an abstraction-refinement loop.<n>We prove the soundness and completeness of the proposed approaches and validate them experimentally.
arXiv Detail & Related papers (2025-01-16T16:20:37Z) - Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task.
Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making.
Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z) - VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
One understudied capability inVLMs is visual spatial planning.
Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z) - Low-Light Video Enhancement via Spatial-Temporal Consistent Illumination and Reflection Decomposition [68.6707284662443]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic and static scenes plagued by severe invisibility and noise.
One critical aspect is formulating a consistency constraint specifically for temporal-spatial illumination and appearance enhanced versions.
We present an innovative video Retinex-based decomposition strategy that operates without the need for explicit supervision.
arXiv Detail & Related papers (2024-05-24T15:56:40Z) - Salient Sparse Visual Odometry With Pose-Only Supervision [45.450357610621985]
Visual odometry (VO) is vital for navigation of autonomous systems.
Traditional VO methods struggle with challenges like variable lighting and motion blur.
Deep learning-based VO, though more adaptable, can face generalization problems in new environments.
This paper presents a novel hybrid visual odometry (VO) framework that leverages pose-only supervision.
arXiv Detail & Related papers (2024-04-06T16:48:08Z) - LLM-SAP: Large Language Models Situational Awareness Based Planning [0.0]
We employ a multi-agent reasoning framework to develop a methodology that anticipates and actively mitigates potential risks.
Our approach diverges from traditional automata theory by incorporating the complexity of human-centric interactions into the planning process.
arXiv Detail & Related papers (2023-12-26T17:19:09Z) - Planning as In-Painting: A Diffusion-Based Embodied Task Planning
Framework for Environments under Uncertainty [56.30846158280031]
Task planning for embodied AI has been one of the most challenging problems.
We propose a task-agnostic method named 'planning as in-painting'
The proposed framework achieves promising performances in various embodied AI tasks.
arXiv Detail & Related papers (2023-12-02T10:07:17Z) - Robust and Precise Facial Landmark Detection by Self-Calibrated Pose
Attention Network [73.56802915291917]
We propose a semi-supervised framework to achieve more robust and precise facial landmark detection.
A Boundary-Aware Landmark Intensity (BALI) field is proposed to model more effective facial shape constraints.
A Self-Calibrated Pose Attention (SCPA) model is designed to provide a self-learned objective function that enforces intermediate supervision.
arXiv Detail & Related papers (2021-12-23T02:51:08Z) - Efficient Empowerment Estimation for Unsupervised Stabilization [75.32013242448151]
empowerment principle enables unsupervised stabilization of dynamical systems at upright positions.
We propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel.
We show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images.
arXiv Detail & Related papers (2020-07-14T21:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.