Enhancing Fault Detection for Large Language Models via Mutation-Based Confidence Smoothing
- URL: http://arxiv.org/abs/2404.14419v1
- Date: Sun, 14 Apr 2024 07:06:12 GMT
- Title: Enhancing Fault Detection for Large Language Models via Mutation-Based Confidence Smoothing
- Authors: Qiang Hu, Jin Wen, Maxime Cordy, Yuheng Huang, Xiaofei Xie, Lei Ma,
- Abstract summary: How to quickly reveal faults in large language models (LLMs) is important, but challenging.
Existing fault detection methods cannot perform well on LLMs.
We propose MuCS, a prompt Mutation-based prediction Confidence Smoothing method for LLMs.
- Score: 24.55745161068782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) achieved great success in multiple application domains and attracted huge attention from different research communities recently. Unfortunately, even for the best LLM, there still exist many faults that LLM cannot correctly predict. Such faults will harm the usability of LLMs. How to quickly reveal them in LLMs is important, but challenging. The reasons are twofold, 1) the heavy labeling effort for preparing the test data, and 2) accessing closed-source LLMs such as GPT4 is money-required. To handle this problem, in the traditional deep learning testing field, test selection methods have been proposed for efficiently testing deep learning models by prioritizing faults. However, the usefulness of these methods on LLMs is unclear and under exploration. In this paper, we first study the effectiveness of existing fault detection methods for LLMs. Experimental results on four different tasks~(including both code tasks and natural language processing tasks) and four LLMs (e.g., LLaMA and GPT4) demonstrated that existing fault detection methods cannot perform well on LLMs (e.g., seven out of eight methods perform worse than random selection on LLaMA). To enhance existing fault detection methods, we propose MuCS, a prompt Mutation-based prediction Confidence Smoothing method for LLMs. Concretely, we mutate the prompts and compute the average prediction confidence of all mutants as the input of fault detection methods. The results show that our proposed solution significantly enhances existing methods with the improvement of test relative coverage by up to 97.64%.
Related papers
- LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models [13.713870642186254]
Large language models (LLMs) demonstrate remarkable capabilities across various tasks.<n>Existing evaluation methods suffer from issues such as data contamination, black-box operation, and subjective preference.<n>We propose a novel benchmark-free evaluation paradigm, LLM-Crowdsourced.
arXiv Detail & Related papers (2025-07-30T03:50:46Z) - ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models [70.33764118171463]
Large Language Models (LLMs) tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability.<n>We develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems.<n>LLMs fail to directly identify unsolvable problems and always generate fabricated responses.
arXiv Detail & Related papers (2025-07-03T19:19:44Z) - Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation [7.262231066394782]
Quantization has emerged as a mainstream method for compressing Large Language Models (LLMs)<n>We present the first systematic investigation of how quantization affects the robustness of LLMs in code generation tasks.<n>Our findings challenge conventional wisdom by demonstrating that quantized LLMs often exhibit superior robustness compared to their full-precision counterparts.
arXiv Detail & Related papers (2025-06-28T06:32:25Z) - Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review [11.856357456956351]
Large Language Models (LLMs) have been transformative across many domains.
Uncertainty Quantification (UQ) to measure uncertainty and employed calibration techniques to address misalignment between uncertainty and accuracy.
This survey is the first dedicated study to review the calibration methods and relevant metrics for LLMs.
arXiv Detail & Related papers (2025-04-25T13:34:40Z) - Enhancing Fault Localization Through Ordered Code Analysis with LLM Agents and Self-Reflection [8.22737389683156]
Large Language Models (LLMs) offer promising improvements in fault localization by enhancing code comprehension and reasoning.
We introduce LLM4FL, a novel LLM-agent-based fault localization approach that integrates SBFL rankings with a divide-and-conquer strategy.
Our results demonstrate that LLM4FL outperforms AutoFL by 19.27% in Top-1 accuracy and surpasses state-of-the-art supervised techniques such as DeepFL and Grace.
arXiv Detail & Related papers (2024-09-20T16:47:34Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.
Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.
We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
UBENCH is a benchmark for evaluating large language models.
It includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities.
We also evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding.
arXiv Detail & Related papers (2024-06-18T16:50:38Z) - Are Large Language Models Good Statisticians? [10.42853117200315]
StatQA is a new benchmark designed for statistical analysis tasks.
We show that even state-of-the-art models such as GPT-4o achieve a best performance of only 64.83%.
While open-source LLMs show limited capability, those fine-tuned ones exhibit marked improvements.
arXiv Detail & Related papers (2024-06-12T02:23:51Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods.
We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification.
Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z) - Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method [36.24876571343749]
Large Language Models (LLMs) have shown great potential in Natural Language Processing (NLP) tasks.
Recent literature reveals that LLMs generate nonfactual responses intermittently.
We propose a novel self-detection method to detect which questions that a LLM does not know that are prone to generate nonfactual results.
arXiv Detail & Related papers (2023-10-27T06:22:14Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Certified Robustness for Large Language Models with Self-Denoising [42.916661225753145]
We propose to denoise the corrupted inputs with large language models (LLMs) in a self-denoising manner.
Our method outperforms the existing certification methods under both certified robustness and empirical robustness.
arXiv Detail & Related papers (2023-07-14T05:40:24Z) - Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [60.61002524947733]
Previous confidence elicitation methods rely on white-box access to internal model information or model fine-tuning.
This leads to a growing need to explore the untapped area of black-box approaches for uncertainty estimation.
We define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency.
arXiv Detail & Related papers (2023-06-22T17:31:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.