Related papers: The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input

The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input

URL: http://arxiv.org/abs/2501.03200v1
Date: Mon, 06 Jan 2025 18:28:04 GMT
Title: The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input
Authors: Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldshtein, Dipanjan Das,
Abstract summary: FACTS Grounding evaluates language models' ability to generate text that is factually accurate with respect to given context.<n>Models are evaluated using automated judge models in two phases.<n>The FACTS Grounding leaderboard will be actively maintained over time.
Score: 19.692322010161636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce FACTS Grounding, an online leaderboard and associated benchmark that evaluates language models' ability to generate text that is factually accurate with respect to given context in the user prompt. In our benchmark, each prompt includes a user request and a full document, with a maximum length of 32k tokens, requiring long-form responses. The long-form responses are required to be fully grounded in the provided context document while fulfilling the user request. Models are evaluated using automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set to pick the best prompt template, and the final factuality score is an aggregate of multiple judge models to mitigate evaluation bias. The FACTS Grounding leaderboard will be actively maintained over time, and contains both public and private splits to allow for external participation while guarding the integrity of the leaderboard. It can be found at https://www.kaggle.com/facts-leaderboard.

Related papers

Do Chatbot LLMs Talk Too Much? The YapBench Benchmark [1.6149401958316794]
YapBench is a benchmark for quantifying user-visible over-generation on brevity-ideal prompts.<n>Each item consists of a single-turn prompt, a curated minimal-sufficient baseline answer, and a category label.<n>We summarize model performance via the YapIndex, a uniformly weighted average of category-level median YapScores.
arXiv Detail & Related papers (2026-01-02T09:43:52Z)
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality [70.45240108873001]
The FACTS Leaderboard is an online leaderboard suite that comprehensively evaluates the ability of language models to generate factually accurate text.<n>The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards.
arXiv Detail & Related papers (2025-12-11T16:35:14Z)
The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge [17.555073770285095]
Large language models (LLMs) are increasingly deployed as automatic judges to evaluate system outputs in tasks such as summarization, dialogue, and creative writing.<n>We show that current LLM judges fail on both counts by relying on shortcuts introduced in the prompt.
arXiv Detail & Related papers (2025-09-30T10:48:08Z)
Prompt-to-Leaderboard [20.299021582134202]
We propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt. Our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves.
arXiv Detail & Related papers (2025-02-20T18:58:07Z)
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations [85.81295563405433]
Language model users often issue queries that lack specification, where the context under which a query was issued is not explicit. We present contextualized evaluations, a protocol that synthetically constructs context surrounding an under-specified query and provides it during evaluation. We find that the presence of context can 1) alter conclusions drawn from evaluation, even flipping win rates between model pairs, 2) nudge evaluators to make fewer judgments based on surface-level criteria, like style, and 3) provide new insights about model behavior across diverse contexts.
arXiv Detail & Related papers (2024-11-11T18:58:38Z)
Trust but Verify: Programmatic VLM Evaluation in the Wild [62.14071929143684]
Programmatic VLM Evaluation (PROVE) is a new benchmarking paradigm for evaluating VLM responses to open-ended queries. We benchmark the helpfulness-truthfulness trade-offs of a range ofVLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two.
arXiv Detail & Related papers (2024-10-17T01:19:18Z)
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models [67.62126108440003]
We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. We discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment.
arXiv Detail & Related papers (2024-05-03T17:59:55Z)
RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models [17.782410287625645]
This paper proposes a benchmark, RefuteBench, covering tasks such as question answering, machine translation, and email writing. The evaluation aims to assess whether models can positively accept feedback in form of refuting instructions and whether they can consistently adhere to user demands throughout the conversation.
arXiv Detail & Related papers (2024-02-21T01:39:56Z)
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation. It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)
Evaluating Large Language Models for Document-grounded Response Generation in Information-Seeking Dialogues [17.41334279810008]
We investigate the use of large language models (LLMs) like ChatGPT for document-grounded response generation in the context of information-seeking dialogues. For evaluation, we use the MultiDoc2Dial corpus of task-oriented dialogues in four social service domains. While both ChatGPT variants are more likely to include information not present in the relevant segments, possibly including a presence of hallucinations, they are rated higher than both the shared task winning system and human responses.
arXiv Detail & Related papers (2023-09-21T07:28:03Z)
Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other. We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z)
What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation [73.03318027164605]
We propose to use information that can be automatically extracted from the next user utterance as a proxy to measure the quality of the previous system response. Our model generalizes across both spoken and written open-domain dialog corpora collected from real and paid users.
arXiv Detail & Related papers (2022-03-25T22:09:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.