LLMMaps -- A Visual Metaphor for Stratified Evaluation of Large Language
Models
- URL: http://arxiv.org/abs/2304.00457v3
- Date: Thu, 12 Oct 2023 11:35:35 GMT
- Title: LLMMaps -- A Visual Metaphor for Stratified Evaluation of Large Language
Models
- Authors: Patrik Puchert, Poonam Poonam, Christian van Onzenoodt, Timo Ropinski
- Abstract summary: Large Language Models (LLMs) have revolutionized natural language processing and demonstrated impressive capabilities in various tasks.
They are prone to hallucinations, where the model exposes incorrect or false information in its responses.
We propose LLMMaps as a novel visualization technique that enables users to evaluate LLMs' performance with respect to Q&A datasets.
- Score: 13.659853119356507
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have revolutionized natural language processing
and demonstrated impressive capabilities in various tasks. Unfortunately, they
are prone to hallucinations, where the model exposes incorrect or false
information in its responses, which renders diligent evaluation approaches
mandatory. While LLM performance in specific knowledge fields is often
evaluated based on question and answer (Q&A) datasets, such evaluations usually
report only a single accuracy number for the dataset, which often covers an
entire field. This field-based evaluation, is problematic with respect to
transparency and model improvement. A stratified evaluation could instead
reveal subfields, where hallucinations are more likely to occur and thus help
to better assess LLMs' risks and guide their further development. To support
such stratified evaluations, we propose LLMMaps as a novel visualization
technique that enables users to evaluate LLMs' performance with respect to Q&A
datasets. LLMMaps provide detailed insights into LLMs' knowledge capabilities
in different subfields, by transforming Q&A datasets as well as LLM responses
into an internal knowledge structure. An extension for comparative
visualization furthermore, allows for the detailed comparison of multiple LLMs.
To assess LLMMaps we use them to conduct a comparative analysis of several
state-of-the-art LLMs, such as BLOOM, GPT-2, GPT-3, ChatGPT and LLaMa-13B, as
well as two qualitative user evaluations. All necessary source code and data
for generating LLMMaps to be used in scientific publications and elsewhere is
available on GitHub: https://github.com/viscom-ulm/LLMMaps
Related papers
- What do Large Language Models Need for Machine Translation Evaluation? [12.42394213466485]
Large language models (LLMs) can achieve results comparable to fine-tuned multilingual pre-trained language models.
This paper explores what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate machine translation quality.
arXiv Detail & Related papers (2024-10-04T09:50:45Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Reasoning on Efficient Knowledge Paths:Knowledge Graph Guides Large Language Model for Domain Question Answering [18.94220625114711]
Large language models (LLMs) perform surprisingly well and outperform human experts on many tasks.
This paper integrates and optimized a pipeline for selecting reasoning paths from KG based on LLM.
We also propose a simple and effective subgraph retrieval method based on chain of thought (CoT) and page rank.
arXiv Detail & Related papers (2024-04-16T08:28:16Z) - Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs [30.179703001666173]
Factuality issue is a critical concern for Large Language Models (LLMs)
We propose GraphEval to evaluate an LLM's performance using a substantially large test dataset.
Test dataset is retrieved from a large knowledge graph with more than 10 million facts without expensive human efforts.
arXiv Detail & Related papers (2024-04-01T06:01:17Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z) - Statistical Knowledge Assessment for Large Language Models [79.07989821512128]
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers?
We propose KaRR, a statistical approach to assess factual knowledge for LLMs.
Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
arXiv Detail & Related papers (2023-05-17T18:54:37Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.