Related papers: Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided and Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation

Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided and Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation

URL: http://arxiv.org/abs/2503.13055v2
Date: Wed, 08 Oct 2025 03:30:23 GMT
Title: Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided and Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation
Authors: Yu-Hong Shen, Chuan-Yu Wu, Yi-Ru Yang, Yen-Ling Tai, Yi-Ting Chen,
Abstract summary: We introduce textbfQuARC (Quantity, Analysis, Relative positioning, Collision), a new benchmark based on a food preparation scenario.<n>We tackle two major limitations of current MLLMs: cross-modal distraction and geometric infeasibility.<n>Our method achieves a 76.7% success rate on the benchmark, significantly outperforming the ViLa baseline.
Score: 5.903105418868711
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate the use of Multimodal Large Language Models (MLLMs) with in-context learning for closed-loop task planning in instruction-following manipulation. We identify four essential requirements for successful task planning: quantity estimation, reachability analysis, relative positioning, and collision avoidance. However, existing benchmarks fail to support holistic evaluation across all these aspects. To address this gap, we introduce \textbf{QuARC} (Quantity, Analysis, Relative positioning, Collision), a new benchmark based on a food preparation scenario that integrates all four challenges. Using QuARC, we reveal two major limitations of current MLLMs: cross-modal distraction and geometric infeasibility. To tackle these, we adapt Chain-of-Thought with Self-Consistency to mitigate reasoning loss from cross-modal distractions and incorporate an affordance predictor to guide planning based on geometric feasibility. Our comprehensive evaluation analyzes performance across multiple baselines and explains sources of improvement. Our method achieves a 76.7\% success rate on the benchmark, significantly outperforming the ViLa baseline (36.7\%), without requiring additional finetuning. Code and dataset are available at https://hcis-lab.github.io/Affordance-Guided-Self-Consistent-MLLM.

Related papers

MAXS: Meta-Adaptive Exploration with LLM Agents [48.04723638253802]
MaxS is a meta-adaptive reasoning framework based on Large Language Model (LLM) Agents.<n>MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead.<n>It combines step consistency variance and inter-step trend slopes to jointly select stable, consistent, and high-value reasoning steps.
arXiv Detail & Related papers (2026-01-14T07:48:00Z)
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z)
Plan before Solving: Problem-Aware Strategy Routing for Mathematical Reasoning with LLMs [49.995906301946]
Existing methods usually leverage a fixed strategy to guide Large Language Models (LLMs) to perform mathematical reasoning.<n>Our analysis reveals that the single strategy cannot adapt to problem-specific requirements and thus overlooks the trade-off between effectiveness and efficiency.<n>We propose Planning and Routing through Instance-Specific Modeling (PRISM), a novel framework that decouples mathematical reasoning into two stages: strategy planning and targeted execution.
arXiv Detail & Related papers (2025-09-29T07:22:41Z)
Plan Verification for LLM-Based Embodied Task Completion Agents [10.439882851477162]
Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy.<n>We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions.
arXiv Detail & Related papers (2025-09-02T19:06:56Z)
Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z)
Survey of Adversarial Robustness in Multimodal Large Language Models [17.926240920647892]
Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance in artificial intelligence. Their deployment in real-world applications raises significant concerns about adversarial vulnerabilities. This paper reviews the adversarial robustness of MLLMs, covering different modalities.
arXiv Detail & Related papers (2025-03-18T06:54:59Z)
An Empirical Exploration of ChatGPT's Ability to Support Problem Formulation Tasks for Mission Engineering and a Documentation of its Performance Variability [0.0]
This paper explores the quality and consistency of large language models (LLM) in supporting mission engineering problem formulation tasks.<n>We identify a relevant reference problem, a NASA space mission design challenge, and document ChatGPT-3.5's ability to perform stakeholder identification tasks.<n>We find that the LLM performs well in identifying human-focused stakeholders but poorly in recognizing external systems and environmental factors.
arXiv Detail & Related papers (2025-02-05T17:58:23Z)
Evaluating and Advancing Multimodal Large Language Models in Perception Ability Lens [30.083110119139793]
We introduce textbfAbilityLens, a unified benchmark designed to evaluate MLLMs in six key perception abilities.<n>We identify the strengths and weaknesses of current main-stream MLLMs, highlighting stability patterns and revealing a notable performance gap between state-of-the-art open-source and closed-source models.
arXiv Detail & Related papers (2024-11-22T04:41:20Z)
Understanding Chain-of-Thought in LLMs through Information Theory [16.78730663293352]
We formalize Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) through an information-theoretic lens.<n>Specifically, our framework quantifies the information-gain' at each reasoning step, enabling the identification of failure modes.<n>We demonstrate the efficacy of our approach through extensive experiments on toy arithmetic, GSM8K and PRM800k datasets.
arXiv Detail & Related papers (2024-11-18T19:14:36Z)
The Role of Deductive and Inductive Reasoning in Large Language Models [37.430396755248104]
We propose the Deductive and InDuctive(DID) method to enhance Large Language Models (LLMs) reasoning.<n>DID implements a dual-metric complexity evaluation system that combines Littlestone dimension and information entropy.<n>Our results demonstrate significant improvements in reasoning quality and solution accuracy.
arXiv Detail & Related papers (2024-10-03T18:30:47Z)
ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models [38.89166693142495]
ET-Plan-Bench is a benchmark for embodied task planning using Large Language Models (LLMs)<n>It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities.<n>Our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework.
arXiv Detail & Related papers (2024-10-02T19:56:38Z)
Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models [12.841405829775852]
We introduce the modality importance score (MIS) to identify bias inVidQA benchmarks and datasets.<n>We also propose an innovative method using state-of-the-art MLLMs to estimate the modality importance.<n>Our results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets.
arXiv Detail & Related papers (2024-08-22T23:32:42Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset [50.36095192314595]
Large Language Models (LLMs) function as conscious agents with generalizable reasoning capabilities. This ability remains underexplored due to the complexity of modeling infinite possible changes in an event. We introduce the first-ever benchmark, MARS, comprising three tasks corresponding to each step.
arXiv Detail & Related papers (2024-06-04T08:35:04Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs. We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency. We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)
Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text. Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)
TaskLAMA: Probing the Complex Task Understanding of Language Models [13.336015994186955]
Structured Complex Task Decomposition (SCTD) is a problem of breaking down a complex real-world task into a directed acyclic graph over individual steps that contribute to achieving the task. We probe how accurately SCTD can be done with the knowledge extracted from Large Language Models (LLMs) Our experiments reveal that LLMs are able to decompose complex tasks into individual steps effectively, with a relative improvement of 15% to 280% over the best baseline.
arXiv Detail & Related papers (2023-08-29T13:36:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.