Related papers: ProbeLLM: Automating Principled Diagnosis of LLM Failures

ProbeLLM: Automating Principled Diagnosis of LLM Failures

URL: http://arxiv.org/abs/2602.12966v1
Date: Fri, 13 Feb 2026 14:33:13 GMT
Title: ProbeLLM: Automating Principled Diagnosis of LLM Failures
Authors: Yue Huang, Zhengzhe Jiang, Yuchen Ma, Yu Jiang, Xiangqi Wang, Yujun Zhou, Yuexing Hao, Kehan Guo, Pin-Yu Chen, Stefan Feuerriegel, Xiangliang Zhang,
Abstract summary: We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes.<n>By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence.
Score: 89.44131968886184
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.

Related papers

HyperNet-Adaptation for Diffusion-Based Test Case Generation [2.0430493421725076]
We present HyNeA, a generative testing method that enables direct and efficient control over diffusion-based generation.<n>This approach enables the targeted generation of realistic failure cases at substantially lower computational cost than search-based methods.
arXiv Detail & Related papers (2026-01-21T14:45:15Z)
Contamination Detection for VLMs using Multi-Modal Semantic Perturbation [73.76465227729818]
Open-source Vision-Language Models (VLMs) have achieved state-of-the-art performance on benchmark tasks.<n>Pretraining corpora raise a critical concern for both practitioners and users: inflated performance due to test-set leakage.<n>We show that existing detection approaches either fail outright or exhibit inconsistent behavior.<n>We propose a novel simple yet effective detection method based on multi-modal semantic perturbation.
arXiv Detail & Related papers (2025-11-05T18:59:52Z)
When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning [22.39245479538899]
We introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result.<n>A model-agnostic evaluation layer treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing.<n>A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead)
arXiv Detail & Related papers (2025-11-04T18:20:13Z)
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z)
CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection [49.11819337853632]
Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types, and the scarcity of training data.<n>We propose CLIPfusion, a method that leverages both discriminative and generative foundation models.<n>We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection.
arXiv Detail & Related papers (2025-06-13T13:30:15Z)
TrustLoRA: Low-Rank Adaptation for Failure Detection under Out-of-distribution Data [62.22804234013273]
We propose a simple failure detection framework to unify and facilitate classification with rejection under both covariate and semantic shifts.<n>Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters, we can enhance the failure detection ability effectively and flexibly.
arXiv Detail & Related papers (2025-04-20T09:20:55Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning [16.824343439487617]
Large language models (LLMs) struggle with multi-step reasoning, where inference-time scaling has emerged as a promising strategy for performance improvement.<n>Verifier-guided search outperforms repeated sampling when sample size is limited by selecting and prioritizing valid reasoning paths.<n>As sample size increases, verifier-guided search exhibits diminishing advantages and eventually underperforms repeated sampling.
arXiv Detail & Related papers (2025-02-01T02:08:49Z)
Degradation Modeling and Prognostic Analysis Under Unknown Failure Modes [17.72961616186932]
operating units often experience various failure modes in complex systems. Current prognostic approaches either ignore failure modes during degradation or assume known failure mode labels. High dimensionality and complex relations of sensor signals make it challenging to identify the failure modes accurately.
arXiv Detail & Related papers (2024-02-29T15:57:09Z)
PAGER: A Framework for Failure Analysis of Deep Regression Models [27.80057763697904]
We introduce PAGER (Principled Analysis of Generalization Errors in Regressors), a framework to systematically detect and characterize failures in deep regressors. Built upon the principle of anchored training in deep models, PAGER unifies both epistemic uncertainty and complementary manifold non-conformity scores to accurately organize samples into different risk regimes.
arXiv Detail & Related papers (2023-09-20T00:37:35Z)
LafitE: Latent Diffusion Model with Feature Editing for Unsupervised Multi-class Anomaly Detection [12.596635603629725]
We develop a unified model to detect anomalies from objects belonging to multiple classes when only normal data is accessible. We first explore the generative-based approach and investigate latent diffusion models for reconstruction. We introduce a feature editing strategy that modifies the input feature space of the diffusion model to further alleviate identity shortcuts''
arXiv Detail & Related papers (2023-07-16T14:41:22Z)
Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle. In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize. Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.