Related papers: GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback

GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback

URL: http://arxiv.org/abs/2503.15035v1
Date: Wed, 19 Mar 2025 09:25:32 GMT
Title: GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback
Authors: Sungjae Lee, Yeonjoo Hong, Kwang In Kim,
Abstract summary: Even state-of-the-art policy models frequently exhibit unstable grasping behaviors.<n>We introduce GraspCorrect, a plug-and-play module designed to enhance grasp performance through vision-language model-guided feedback.
Score: 23.48582504679409
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Despite significant advancements in robotic manipulation, achieving consistent and stable grasping remains a fundamental challenge, often limiting the successful execution of complex tasks. Our analysis reveals that even state-of-the-art policy models frequently exhibit unstable grasping behaviors, leading to failure cases that create bottlenecks in real-world robotic applications. To address these challenges, we introduce GraspCorrect, a plug-and-play module designed to enhance grasp performance through vision-language model-guided feedback. GraspCorrect employs an iterative visual question-answering framework with two key components: grasp-guided prompting, which incorporates task-specific constraints, and object-aware sampling, which ensures the selection of physically feasible grasp candidates. By iteratively generating intermediate visual goals and translating them into joint-level actions, GraspCorrect significantly improves grasp stability and consistently enhances task success rates across existing policy models in the RLBench and CALVIN datasets.

Related papers

Towards Robust Semantic Correspondence: A Benchmark and Insights [0.0]
We establish a novel benchmark for evaluating semantic correspondence in adverse conditions.<n>The benchmark dataset comprises 14 distinct challenging scenarios that reflect commonly encountered imaging issues.<n>We provide several key insights into the robustness of semantic correspondence approaches.
arXiv Detail & Related papers (2025-08-01T02:38:39Z)
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models [28.20124264650572]
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across tasks.<n>They often exhibit difficulty in distinguishing task-relevant from irrelevant signals, particularly in tasks like Visual Question Answering (VQA)<n>This vulnerability becomes more evident in modality-specific tasks such as image classification or pure text question answering.<n>We propose a novel framework to fine-tune MLLMs, including perturbation-based data augmentation with both perturbations and adversarial perturbations.
arXiv Detail & Related papers (2025-05-26T07:31:32Z)
Salience-Invariant Consistent Policy Learning for Generalization in Visual Reinforcement Learning [12.9372563969007]
Generalizing policies to unseen scenarios remains a critical challenge in visual reinforcement learning. In unseen environments, distracting pixels may lead agents to extract representations containing task-irrelevant information. We propose the Salience-Invariant Consistent Policy Learning algorithm, an efficient framework for zero-shot generalization.
arXiv Detail & Related papers (2025-02-12T12:00:16Z)
SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [14.786100203787194]
Large language models demonstrate exceptional performance in simple code generation tasks but face challenges in tackling complex problems. We propose a reasoning-augmented data generation process, SRA-MCTS, which guides the model to autonomously generate high-quality intermediate reasoning paths. Our method operates entirely through the model itself without requiring additional supervision.
arXiv Detail & Related papers (2024-11-17T12:31:04Z)
Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks [50.75902473813379]
This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models. The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes.
arXiv Detail & Related papers (2024-07-04T14:36:49Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model [86.9619638550683]
Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired data.<n>However, these models display significant limitations when applied to downstream tasks, such as fine-grained image classification, as a result of decision shortcuts''
arXiv Detail & Related papers (2024-03-01T09:01:53Z)
CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning [61.21923643289266]
Chain of Manipulations is a mechanism that enables Vision-Language Models to solve problems step-by-step with evidence.<n>After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) actively without involving external tools.<n>Our trained model, textbfCogCoM, achieves state-of-the-art performance across 9 benchmarks from 4 categories.
arXiv Detail & Related papers (2024-02-06T18:43:48Z)
De-fine: Decomposing and Refining Visual Programs with Auto-Feedback [75.62712247421146]
De-fine is a training-free framework that decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. Our experiments across various visual tasks show that De-fine creates more robust programs.
arXiv Detail & Related papers (2023-11-21T06:24:09Z)
GADY: Unsupervised Anomaly Detection on Dynamic Graphs [18.1896489628884]
We propose a continuous dynamic graph model to capture the fine-grained information, which breaks the limit of existing discrete methods. For the second challenge, we pioneer the use of Generative Adversarial Networks to generate negative interactions. Our proposed GADY significantly outperforms the previous state-of-the-art method on three real-world datasets.
arXiv Detail & Related papers (2023-10-25T05:27:45Z)
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users. We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks. We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z)
Goal-Aware Prediction: Learning to Model What Matters [105.43098326577434]
One of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model and that of the downstream planner or policy. We propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space. We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.
arXiv Detail & Related papers (2020-07-14T16:42:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.