Related papers: Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

URL: http://arxiv.org/abs/2511.19768v1
Date: Mon, 24 Nov 2025 22:50:50 GMT
Title: Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering
Authors: Noah Frahm, Prakrut Patel, Yue Zhang, Shoubin Yu, Mohit Bansal, Roni Sengupta,
Abstract summary: Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning.<n>We propose Prune-Then-Plan, a framework that stabilizes exploration through step-level calibration.
Score: 52.69447404069251
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for step-level exploration, VLMs often exhibit frontier oscillations, unstable back-and-forth movements caused by overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality. We propose Prune-Then-Plan, a simple and effective framework that stabilizes exploration through step-level calibration. Instead of trusting raw VLM scores, our method prunes implausible frontier choices using a Holm-Bonferroni inspired pruning procedure and then delegates final decisions to a coverage-based planner. This separation converts overconfident predictions into conservative, interpretable actions by relying on human-level judgments to calibrate the step-level behavior of VLMs. Integrated into the 3D-Mem EQA framework, our approach achieves relative improvements of up to 49% and 33% in visually grounded SPL and LLM-Match metrics respectively over baselines. Overall, our method achieves better scene coverage under equal exploration budgets on both OpenEQA and EXPRESS-Bench datasets.

Related papers

Verified Critical Step Optimization for LLM Agents [67.05296684575445]
Critical Step Optimization focuses preference learning on verified critical steps.<n>Method starts from failed policy trajectories rather than expert demonstrations.<n>Experiments on GAIA-Text-103 and XBench-DeepSearch show that CSO achieves 37% and 26% relative improvement over the SFT baseline.
arXiv Detail & Related papers (2026-02-03T11:41:02Z)
LLMs can Compress LLMs: Adaptive Pruning by Agents [0.0]
Post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance.<n>We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent.<n>We evaluate our approach on Q3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines.
arXiv Detail & Related papers (2026-01-14T18:45:36Z)
Knowing the Answer Isn't Enough: Fixing Reasoning Path Failures in LVLMs [85.37131922131657]
We reveal a critical yet underexplored flaw in Large Vision-Language Models (LVLMs)<n>Even when these models know the correct answer, they frequently arrive there through incorrect reasoning paths.<n>We propose PSO (Path-Select Optimization), a two-stage post-training framework designed to enhance both the reasoning performance and stability of existing LVLMs.
arXiv Detail & Related papers (2025-12-06T03:02:55Z)
Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization [53.82400605816587]
Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation.<n>A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios.<n>We introduce Continual AQA (CAQA), which equips with Continual Learning capabilities to handle evolving distributions.
arXiv Detail & Related papers (2025-10-08T10:09:47Z)
Quantization Meets Reasoning: Exploring and Mitigating Degradation of Low-Bit LLMs in Mathematical Reasoning [39.56908863102256]
Low-bit post-training quantization impairs mathematical reasoning up to 69.81% in harder settings.<n>We address two deployment-critical questions with process-level precision.<n>In our settings, as few as 332 curated examples and 3--5 minutes of compute on a single GPU recover 4-bit weight math reasoning toward the full-precision baseline.
arXiv Detail & Related papers (2025-05-16T12:11:40Z)
Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction [0.0]
We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification.<n>We show that the framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains.<n>This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.
arXiv Detail & Related papers (2025-04-24T15:39:46Z)
Learning Conformal Abstention Policies for Adaptive Risk Management in Large Language and Vision-Language Models [3.958317527488534]
Large Language and Vision-Language Models (LLMs/VLMs) are increasingly used in safety-critical applications.<n>Uncertainty quantification helps assess prediction confidence and enables abstention when uncertainty is high.<n>We propose learnable abstention, integrating reinforcement learning (RL) with Conformal Prediction (CP) to optimize abstention thresholds.
arXiv Detail & Related papers (2025-02-08T21:30:41Z)
Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models [104.55763564037831]
We train a regression model that leverages attention maps, probabilities on the current generation step, and recurrently computed uncertainty scores from previously generated tokens.<n>Our evaluation shows that the proposed method is highly effective for selective generation, achieving substantial improvements over rivaling unsupervised and supervised approaches.
arXiv Detail & Related papers (2024-08-20T09:42:26Z)
ROPO: Robust Preference Optimization for Large Language Models [59.10763211091664]
We propose an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models. Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods.
arXiv Detail & Related papers (2024-04-05T13:58:51Z)
How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry'' Benchmark [60.72725673114168]
We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets. We propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark.
arXiv Detail & Related papers (2023-12-21T03:11:30Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.