Related papers: TruthTensor: Evaluating LLMs through Human Imitation on Prediction Market under Drift and Holistic Reasoning

TruthTensor: Evaluating LLMs through Human Imitation on Prediction Market under Drift and Holistic Reasoning

URL: http://arxiv.org/abs/2601.13545v3
Date: Sun, 25 Jan 2026 16:11:15 GMT
Title: TruthTensor: Evaluating LLMs through Human Imitation on Prediction Market under Drift and Holistic Reasoning
Authors: Shirin Shahabi, Spencer Graham, Haruna Isah,
Abstract summary: This paper introduces TruthTensor, a novel, reproducible evaluation paradigm that measures reasoning models.<n>Building on forward-looking, contamination-free tasks, our framework anchors evaluation to live prediction markets.<n>TruthTensor demonstrates that models with similar forecast accuracy can diverge markedly in calibration, drift, and risk-sensitivity.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Evaluating language models and AI agents remains fundamentally challenging because static benchmarks fail to capture real-world uncertainty, distribution shift, and the gap between isolated task accuracy and human-aligned decision-making under evolving conditions. This paper introduces TruthTensor, a novel, reproducible evaluation paradigm that measures reasoning models not only as prediction engines but as human-imitation systems operating in socially-grounded, high-entropy environments. Building on forward-looking, contamination-free tasks, our framework anchors evaluation to live prediction markets and combines probabilistic scoring to provide a holistic view of model behavior. TruthTensor complements traditional correctness metrics with drift-centric diagnostics and explicit robustness checks for reproducibility. It specify human vs. automated evaluation roles, annotation protocols, and statistical testing procedures to ensure interpretability and replicability of results. In experiments across 500+ real markets (political, economic, cultural, technological), TruthTensor demonstrates that models with similar forecast accuracy can diverge markedly in calibration, drift, and risk-sensitivity, underscoring the need to evaluate models along multiple axes (accuracy, calibration, narrative stability, cost, and resource efficiency). TruthTensor therefore operationalizes modern evaluation best practices, clear hypothesis framing, careful metric selection, transparent compute/cost reporting, human-in-the-loop validation, and open, versioned evaluation contracts, to produce defensible assessments of LLMs in real-world decision contexts. We publicly released TruthTensor at https://truthtensor.com.

Related papers

Uncertainty and Fairness Awareness in LLM-Based Recommendation Systems [3.937681476010311]
This paper studies how uncertainty and fairness evaluations affect the accuracy, consistency, and trustworthiness of large language models (LLMs)<n>We quantify predictive uncertainty (via entropy) and demonstrate that Google DeepMind's Gemini 1.5 Flash exhibits systematic unfairness for certain sensitive attributes.<n>We propose a novel uncertainty-aware evaluation methodology for RecLLMs, present empirical insights from deep uncertainty case studies, and introduce a personality profile-informed fairness benchmark.
arXiv Detail & Related papers (2026-01-31T17:18:13Z)
AICO: Feature Significance Tests for Supervised Learning [0.9474649136535703]
AICO asks, for any trained regression or classification model, whether each feature genuinely improves model performance.<n>It does so by masking the feature's information and measuring the resulting change in performance.<n>AICO consistently pinpoints the variables that drive model behavior.
arXiv Detail & Related papers (2025-06-29T21:15:40Z)
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs [7.197702136906138]
We propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness.<n> observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset.<n>We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source AI systems.
arXiv Detail & Related papers (2025-05-29T20:45:18Z)
Probabilistic Machine Learning for Noisy Labels in Earth Observation [5.068845500478373]
We train uncertainty-aware probabilistic models across a broad range of high-impact Earth Observation applications.<n>We validate the reliability of the predicted uncertainty estimates, enhancing the interpretability of model predictions.<n>Our findings emphasize the importance of modeling label noise and incorporating uncertainty quantification in EO, paving the way for more accurate, reliable, and trustworthy ML solutions in the field.
arXiv Detail & Related papers (2025-04-04T14:36:33Z)
Poor-Supervised Evaluation for SuperLLM via Mutual Consistency [20.138831477848615]
We propose the PoEM framework to conduct evaluation without accurate labels. We first prove that the capability of a model can be equivalently assessed by the consistency between it and certain reference model. To alleviate the insufficiencies of the conditions in reality, we introduce an algorithm that treats humans (when available) and the models under evaluation as reference models.
arXiv Detail & Related papers (2024-08-25T06:49:03Z)
Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z)
Fair Multivariate Adaptive Regression Splines for Ensuring Equity and Transparency [1.124958340749622]
We propose a fair predictive model based on MARS that incorporates fairness measures in the learning process. MARS is a non-parametric regression model that performs feature selection, handles non-linear relationships, generates interpretable decision rules, and derives optimal splitting criteria on the variables. We apply our fairMARS model to real-world data and demonstrate its effectiveness in terms of accuracy and equity.
arXiv Detail & Related papers (2024-02-23T19:02:24Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Reliability-Aware Prediction via Uncertainty Learning for Person Image Retrieval [51.83967175585896]
UAL aims at providing reliability-aware predictions by considering data uncertainty and model uncertainty simultaneously. Data uncertainty captures the noise" inherent in the sample, while model uncertainty depicts the model's confidence in the sample's prediction.
arXiv Detail & Related papers (2022-10-24T17:53:20Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Approaching Neural Network Uncertainty Realism [53.308409014122816]
Quantifying or at least upper-bounding uncertainties is vital for safety-critical systems such as autonomous vehicles. We evaluate uncertainty realism -- a strict quality criterion -- with a Mahalanobis distance-based statistical test. We adopt it to the automotive domain and show that it significantly improves uncertainty realism compared to a plain encoder-decoder model.
arXiv Detail & Related papers (2021-01-08T11:56:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.