LM-Polygraph: Uncertainty Estimation for Language Models
- URL: http://arxiv.org/abs/2311.07383v1
- Date: Mon, 13 Nov 2023 15:08:59 GMT
- Title: LM-Polygraph: Uncertainty Estimation for Language Models
- Authors: Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev,
Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova,
Alexander Panchenko, Maxim Panov, Timothy Baldwin, Artem Shelmanov
- Abstract summary: Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs)
We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python.
It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
- Score: 71.21409522341482
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in the capabilities of large language models (LLMs) have
paved the way for a myriad of groundbreaking applications in various fields.
However, a significant challenge arises as these models often "hallucinate",
i.e., fabricate facts without providing users an apparent means to discern the
veracity of their statements. Uncertainty estimation (UE) methods are one path
to safer, more responsible, and more effective use of LLMs. However, to date,
research on UE methods for LLMs has been focused primarily on theoretical
rather than engineering contributions. In this work, we tackle this issue by
introducing LM-Polygraph, a framework with implementations of a battery of
state-of-the-art UE methods for LLMs in text generation tasks, with unified
program interfaces in Python. Additionally, it introduces an extendable
benchmark for consistent evaluation of UE techniques by researchers, and a demo
web application that enriches the standard chat dialog with confidence scores,
empowering end-users to discern unreliable responses. LM-Polygraph is
compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and
GPT-4, and is designed to support future releases of similarly-styled LMs.
Related papers
- Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach [0.0]
Large Language Models (LLMs) produce inaccurate outputs, also known as hallucinations.
This paper introduces a supervised learning approach employing only four numerical features derived from tokens and vocabulary probabilities obtained from other evaluators.
The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks.
arXiv Detail & Related papers (2024-05-30T03:00:47Z) - Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks [12.629516072317331]
Syntax-Aware Fill-In-the-Middle (SAFIM) is a new benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task.
This benchmark focuses on syntax-aware completions of program structures such as code blocks and conditional expressions.
arXiv Detail & Related papers (2024-03-07T05:05:56Z) - Found in the Middle: How Language Models Use Long Contexts Better via
Plug-and-Play Positional Encoding [78.36702055076456]
This paper introduces Multi-scale Positional.
(Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of.
LLMs to handle relevant information located in the middle of the context.
arXiv Detail & Related papers (2024-03-05T04:58:37Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - Beyond Text: Unveiling Multimodal Proficiency of Large Language Models
with MultiAPI Benchmark [11.572835837392867]
This study introduces MultiAPI, a pioneering comprehensive large-scale API benchmark dataset.
It consists of 235 diverse API calls and 2,038 contextual prompts, offering a unique platform evaluation of tool-augmented LLMs handling multimodal tasks.
Our findings reveal that while LLMs demonstrate proficiency in API call decision-making, they face challenges in domain identification, function selection, and argument generation.
arXiv Detail & Related papers (2023-11-21T23:26:05Z) - A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.
This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z) - Assessing Hidden Risks of LLMs: An Empirical Study on Robustness,
Consistency, and Credibility [37.682136465784254]
We conduct over a million queries to the mainstream large language models (LLMs) including ChatGPT, LLaMA, and OPT.
We find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level.
We propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation.
arXiv Detail & Related papers (2023-05-15T15:44:51Z) - Augmented Language Models: a Survey [55.965967655575454]
This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools.
We refer to them as Augmented Language Models (ALMs)
The missing token objective allows ALMs to learn to reason, use tools, and even act, while still performing standard natural language tasks.
arXiv Detail & Related papers (2023-02-15T18:25:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.