Assessing Hidden Risks of LLMs: An Empirical Study on Robustness,
Consistency, and Credibility
- URL: http://arxiv.org/abs/2305.10235v4
- Date: Wed, 30 Aug 2023 04:32:36 GMT
- Title: Assessing Hidden Risks of LLMs: An Empirical Study on Robustness,
Consistency, and Credibility
- Authors: Wentao Ye, Mingfeng Ou, Tianyi Li, Yipeng chen, Xuetao Ma, Yifan
Yanggong, Sai Wu, Jie Fu, Gang Chen, Haobo Wang, Junbo Zhao
- Abstract summary: We conduct over a million queries to the mainstream large language models (LLMs) including ChatGPT, LLaMA, and OPT.
We find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level.
We propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation.
- Score: 37.682136465784254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent popularity of large language models (LLMs) has brought a
significant impact to boundless fields, particularly through their open-ended
ecosystem such as the APIs, open-sourced models, and plugins. However, with
their widespread deployment, there is a general lack of research that
thoroughly discusses and analyzes the potential risks concealed. In that case,
we intend to conduct a preliminary but pioneering study covering the
robustness, consistency, and credibility of LLMs systems. With most of the
related literature in the era of LLM uncharted, we propose an automated
workflow that copes with an upscaled number of queries/responses. Overall, we
conduct over a million queries to the mainstream LLMs including ChatGPT, LLaMA,
and OPT. Core to our workflow consists of a data primitive, followed by an
automated interpreter that evaluates these LLMs under different adversarial
metrical systems. As a result, we draw several, and perhaps unfortunate,
conclusions that are quite uncommon from this trendy community. Briefly, they
are: (i)-the minor but inevitable error occurrence in the user-generated query
input may, by chance, cause the LLM to respond unexpectedly; (ii)-LLMs possess
poor consistency when processing semantically similar query input. In addition,
as a side finding, we find that ChatGPT is still capable to yield the correct
answer even when the input is polluted at an extreme level. While this
phenomenon demonstrates the powerful memorization of the LLMs, it raises
serious concerns about using such data for LLM-involved evaluation in academic
development. To deal with it, we propose a novel index associated with a
dataset that roughly decides the feasibility of using such data for
LLM-involved evaluation. Extensive empirical studies are tagged to support the
aforementioned claims.
Related papers
- Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs)
Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs.
Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z) - Benchmarking LLMs on the Semantic Overlap Summarization Task [9.656095701778975]
This paper comprehensively evaluates Large Language Models (LLMs) on the Semantic Overlap Summarization (SOS) task.
We report well-established metrics like ROUGE, BERTscore, and SEM-F1$ on two different datasets of alternative narratives.
arXiv Detail & Related papers (2024-02-26T20:33:50Z) - Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model.
We employ a proxy model which has far fewer parameters, and take its answers as answers.
Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z) - Breaking the Silence: the Threats of Using LLMs in Software Engineering [12.368546216271382]
Large Language Models (LLMs) have gained considerable traction within the Software Engineering (SE) community.
This paper initiates an open discussion on potential threats to the validity of LLM-based research.
arXiv Detail & Related papers (2023-12-13T11:02:19Z) - LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs)
We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python.
It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z) - FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm.
We collect responses generated from large language models and annotate factuality labels in a fine-grained manner.
Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Causal Reasoning and Large Language Models: Opening a New Frontier for Causality [29.433401785920065]
Large language models (LLMs) can generate causal arguments with high probability.
LLMs may be used by human domain experts to save effort in setting up a causal analysis.
arXiv Detail & Related papers (2023-04-28T19:00:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.