Related papers: INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models

INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models

URL: http://arxiv.org/abs/2510.01389v1
Date: Wed, 01 Oct 2025 19:22:48 GMT
Title: INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models
Authors: Ulas Berk Karli, Ziyao Shangguan, Tesca FItzgerald,
Abstract summary: Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor.<n>We present textbfINSIGHT, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help.
Score: 2.509305596181814
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present \textbf{INSIGHT}, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using $\pi_0$-FAST as the underlying model, we extract per-token \emph{entropy}, \emph{log-probability}, and Dirichlet-based estimates of \emph{aleatoric and epistemic uncertainty}, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, offering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty-based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.

Related papers

Understanding Degradation with Vision Language Model [56.09241449206817]
Understanding visual degradations is a critical yet challenging problem in computer vision.<n>We introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning.<n>We also introduce textbfDU-110k, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations.
arXiv Detail & Related papers (2026-02-04T13:51:15Z)
From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z)
Deep Recurrent Hidden Markov Learning Framework for Multi-Stage Advanced Persistent Threat Prediction [0.0538441598991272]
Advanced Persistent Threats (APTs) represent hidden, multistage cyberattacks whose long term persistence and adaptive behavior challenge conventional intrusion detection systems (IDS)<n>This paper proposes E-HiDNet, a unified hybrid deep probabilistic learning framework that integrates convolutional and recurrent neural networks with a Hidden Markov Model (HMM) to allow accurate prediction of the progression of the APT campaign.<n> Simulation results show that E-HiDNet achieves up to 98.8-100% accuracy in stage prediction and significantly outperforms standalone HMMs when four or more observations are available.
arXiv Detail & Related papers (2026-01-11T01:01:10Z)
Preliminary Investigation into Uncertainty-Aware Attack Stage Classification [81.28215542218724]
This work addresses the problem of attack stage inference under uncertainty.<n>We propose a classification approach based on Evidential Deep Learning (EDL), which models predictive uncertainty by outputting parameters of a Dirichlet distribution over possible stages.<n>Preliminary experiments in a simulated environment demonstrate that the proposed model can accurately infer the stage of an attack with confidence.
arXiv Detail & Related papers (2025-08-01T06:58:00Z)
Anomalous Decision Discovery using Inverse Reinforcement Learning [3.3675535571071746]
Anomaly detection plays a critical role in Autonomous Vehicles (AVs) by identifying unusual behaviors through perception systems.<n>Current approaches, which often rely on predefined thresholds or supervised learning paradigms, exhibit reduced efficacy when confronted with unseen scenarios.<n>We present Trajectory-Reward Guided Adaptive Pre-training (TRAP), a novel IRL framework for anomaly detection.
arXiv Detail & Related papers (2025-07-06T17:01:02Z)
FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning [9.960675988638805]
We propose a novel framework called fake audio detection with evidential learning (FADEL)<n>FADEL incorporates model uncertainty into its predictions, thereby leading to more robust performance in OOD scenarios.<n>We demonstrate the validity of uncertainty estimation by analyzing a strong correlation between average uncertainty and equal error rate (EER) across different spoofing algorithms.
arXiv Detail & Related papers (2025-04-22T07:40:35Z)
Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework [18.54098084470481]
We analyze sycophancy across vision-language benchmarks and propose an inference-time mitigation framework.<n>Our framework effectively mitigates sycophancy across all evaluated models, while maintaining performance on neutral prompts.
arXiv Detail & Related papers (2024-08-21T01:03:21Z)
Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance. We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
When Does Confidence-Based Cascade Deferral Suffice? [69.28314307469381]
Cascades are a classical strategy to enable inference cost to vary adaptively across samples. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. Despite being oblivious to the structure of the cascade, confidence-based deferral often works remarkably well in practice.
arXiv Detail & Related papers (2023-07-06T04:13:57Z)
Unsupervised sequence-to-sequence learning for automatic signal quality assessment in multi-channel electrical impedance-based hemodynamic monitoring [0.6875312133832077]
This study proposes an unsupervised sequence-to-sequence learning approach that automatically assesses the motion-induced reliability of the cardiac volume signal (CVS) in hemodynamic monitoring. An encoder-decoder model is trained not only to self-reproduce an input sequence of the CVS but also to extrapolate the future in a parallel fashion. A motion-influenced CVS of low-quality is detected, based on the residual between the input sequence and its neural representation with a cut-off value determined from the two-sigma rule of thumb over the training set.
arXiv Detail & Related papers (2023-05-16T11:52:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.