Auto-BenchmarkCard: Automated Synthesis of Benchmark Documentation
- URL: http://arxiv.org/abs/2512.09577v1
- Date: Wed, 10 Dec 2025 12:09:44 GMT
- Title: Auto-BenchmarkCard: Automated Synthesis of Benchmark Documentation
- Authors: Aris Hofmann, Inge Vejsbjerg, Dhaval Salwala, Elizabeth M. Daly,
- Abstract summary: Auto-BenchmarkCard is a workflow for generating validated descriptions of AI benchmarks.<n> Benchmark documentation is often incomplete or inconsistent, making it difficult to interpret and compare benchmarks across tasks or domains.
- Score: 4.044540605397838
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Auto-BenchmarkCard, a workflow for generating validated descriptions of AI benchmarks. Benchmark documentation is often incomplete or inconsistent, making it difficult to interpret and compare benchmarks across tasks or domains. Auto-BenchmarkCard addresses this gap by combining multi-agent data extraction from heterogeneous sources (e.g., Hugging Face, Unitxt, academic papers) with LLM-driven synthesis. A validation phase evaluates factual accuracy through atomic entailment scoring using the FactReasoner tool. This workflow has the potential to promote transparency, comparability, and reusability in AI benchmark reporting, enabling researchers and practitioners to better navigate and evaluate benchmark choices.
Related papers
- AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z) - InfoSynth: Information-Guided Benchmark Synthesis for LLMs [69.80981631587501]
Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation.<n>Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming.<n>This work introduces Info Synth, a novel framework for automatically generating and evaluating reasoning benchmarks.
arXiv Detail & Related papers (2026-01-02T05:26:27Z) - PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code [1.1164117387254457]
Large Language Model (LLM)-based code assistants have emerged as a powerful application of generative AI.<n>Key requirement for these systems is their ability to accurately follow user instructions.<n>We present PACIFIC, a novel framework designed to automatically generate benchmarks that rigorously assess sequential instruction-following and code dry-running capabilities.
arXiv Detail & Related papers (2025-12-11T14:49:56Z) - Mapping Overlaps in Benchmarks through Perplexity in the Wild [8.321258152814986]
We develop signatures of capacity familiarity to characterize large language model (LLM) benchmarks and their meaningful overlaps.<n>Our analysis situates signatures in relation to both the semantic similarity of benchmark questions and the correlation of model performance.<n>Ultimately, we identify cross-functional overlaps across logic, math, language, instruction following, and world modeling, with coding emerging as the least overlapping domain.
arXiv Detail & Related papers (2025-09-27T20:23:13Z) - Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks [34.09939383415074]
Benchmark Profiling decomposes benchmark performance into ten cognitively grounded abilities.<n>It explains why performance gains do not always translate into user-perceived competence.
arXiv Detail & Related papers (2025-09-23T15:32:47Z) - The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z) - BenchAgents: Multi-Agent Systems for Structured Benchmark Creation [23.653678381444276]
BenchAgents is a framework that automates the creation of evaluation benchmarks.<n>We use BenchAgents to create benchmarks to evaluate capabilities related to planning, constraint satisfaction, and causal reasoning.<n>We then use these benchmarks to study state-of-the-art models and extract new insights into common failure modes and model differences.
arXiv Detail & Related papers (2024-10-29T22:56:18Z) - BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks [23.263430784766026]
Large language models (LLMs) are powerful tools capable of handling diverse tasks.<n>Finding suitable benchmarks is difficult given the many available options.<n>We introduce textttBenchmarkCards, an intuitive and validated documentation framework.
arXiv Detail & Related papers (2024-10-16T19:09:02Z) - Entity Disambiguation via Fusion Entity Decoding [68.77265315142296]
We propose an encoder-decoder model to disambiguate entities with more detailed entity descriptions.
We observe +1.5% improvements in end-to-end entity linking in the GERBIL benchmark compared with EntQA.
arXiv Detail & Related papers (2024-04-02T04:27:54Z) - Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses.
We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document.
Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z) - Generating Benchmarks for Factuality Evaluation of Language Models [61.69950787311278]
We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality.
FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements.
We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation.
arXiv Detail & Related papers (2023-07-13T17:14:38Z) - Towards Interpretable and Efficient Automatic Reference-Based
Summarization Evaluation [160.07938471250048]
Interpretability and efficiency are two important considerations for the adoption of neural automatic metrics.
We develop strong-performing automatic metrics for reference-based summarization evaluation.
arXiv Detail & Related papers (2023-03-07T02:49:50Z) - Mystique: Enabling Accurate and Scalable Generation of Production AI
Benchmarks [2.0315147707806283]
Mystique is an accurate and scalable framework for production AI benchmark generation.
Mystique is scalable, due to its lightweight data collection, in terms of overhead runtime and instrumentation effort.
We evaluate our methodology on several production AI models, and show that benchmarks generated with Mystique closely resemble original AI models.
arXiv Detail & Related papers (2022-12-16T18:46:37Z) - Benchmarking Answer Verification Methods for Question Answering-Based
Summarization Evaluation Metrics [74.28810048824519]
Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not.
We benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods.
arXiv Detail & Related papers (2022-04-21T15:43:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.