Related papers: AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models

AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models

URL: http://arxiv.org/abs/2511.13029v1
Date: Mon, 17 Nov 2025 06:27:16 GMT
Title: AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models
Authors: Declan Jackson, William Keating, George Cameron, Micah Hill-Smith,
Abstract summary: AA- Omniscience is a benchmark designed to measure factual recall and knowledge calibration across 6,000 questions.<n>The evaluation measures a model's Omniscience Index, a bounded metric (-100 to 100) measuring factual recall.<n>Results reveal persistent factuality and calibration weaknesses across frontier models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing language model evaluations primarily measure general capabilities, yet reliable use of these models across a range of domains demands factual accuracy and recognition of knowledge gaps. We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model's Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.

Related papers

Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements [78.87065404966002]
Existing benchmarks predominantly curate questions at the question level.<n>We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up.
arXiv Detail & Related papers (2025-12-31T13:55:54Z)
Do Large Language Models Know What They Don't Know? Kalshibench: A New Benchmark for Evaluating Epistemic Calibration via Prediction Markets [0.0]
A well-calibrated model should express confidence that matches its actual accuracy -- when it claims 80% confidence, it should be correct 80% of the time.<n>We introduce textbfKalshiBench, a benchmark of 300 prediction market questions from Kalshi, a CFTC-regulated exchange.<n>We evaluate five frontier models -- Claude Opus 4.5, GPT-5.2, DeepSeek-V3.2, Qwen3-235B, and Kimi-K2 -- and find textbfsystematic overconfidence across all models.
arXiv Detail & Related papers (2025-12-17T23:23:06Z)
Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges [72.3356133063925]
The paradigm of large language models (LLMs) as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings.<n>Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals.
arXiv Detail & Related papers (2025-09-03T15:48:33Z)
MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams [50.293164501645975]
Multimodal large language models (MLLMs) integrate language and visual cues for problem-solving.<n>Current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge.<n>We introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines.
arXiv Detail & Related papers (2025-08-09T06:21:10Z)
Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z)
An Empirical Comparison of Text Summarization: A Multi-Dimensional Evaluation of Large Language Models [2.1945750784330067]
This research evaluates summarization performance across 17 large language models (OpenAI, Google, Anthropic, open-source)<n>We assessed models on seven diverse datasets using metrics for factual consistency, semantic similarity, lexical overlap, and human-like quality.
arXiv Detail & Related papers (2025-04-06T16:24:22Z)
Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models [2.11457423143017]
This study presents a novel benchmark designed to evaluate the biases and preferences of seven prominent foundation models.<n>We used 400-expert crafted scenarios to analyze results from our selected models.<n>All models exhibit some degree of country-specific biases, often recommending less escalatory and interventionist actions for China and Russia.
arXiv Detail & Related papers (2025-03-08T16:19:13Z)
Comparative Insights from 12 Machine Learning Models in Extracting Economic Ideology from Political Text [0.0]
This study conducts a systematic assessment of the capabilities of 12 machine learning models and model variations in detecting economic ideology.<n>The analysis assesses the performance of several generative, fine-tuned, and zero-shot models at the granular and aggregate levels.
arXiv Detail & Related papers (2025-01-16T18:06:22Z)
A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check. Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models. The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z)
Measuring Massive Multitask Chinese Understanding [16.41629318344805]
This test encompasses four major domains, including medicine, law, psychology, and education. The best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average. All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239.
arXiv Detail & Related papers (2023-04-25T16:51:53Z)
Exploring Strategies for Generalizable Commonsense Reasoning with Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models. Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers. We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z)
Knowledge-driven Data Construction for Zero-shot Evaluation in Commonsense Question Answering [80.60605604261416]
We propose a novel neuro-symbolic framework for zero-shot question answering across commonsense tasks. We vary the set of language models, training regimes, knowledge sources, and data generation strategies, and measure their impact across tasks. We show that, while an individual knowledge graph is better suited for specific tasks, a global knowledge graph brings consistent gains across different tasks.
arXiv Detail & Related papers (2020-11-07T22:52:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.