Related papers: A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain

A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain

URL: http://arxiv.org/abs/2508.09993v1
Date: Tue, 29 Jul 2025 22:49:00 GMT
Title: A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain
Authors: Hugo Massaroli, Leonardo Iara, Emmanuel Iarussi, Viviana Siless,
Abstract summary: Large language models (LLMs) are increasingly deployed in realworld applications, yet concerns about their fairness persist.<n>This paper introduces transparent evaluation protocol for benchmarking the fairness of opensource LLMs using smart contracts on the Internet Computer Protocol (ICP) blockchain.
Score: 0.18570740863168358
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) are increasingly deployed in realworld applications, yet concerns about their fairness persist especially in highstakes domains like criminal justice, education, healthcare, and finance. This paper introduces transparent evaluation protocol for benchmarking the fairness of opensource LLMs using smart contracts on the Internet Computer Protocol (ICP) blockchain (Foundation, 2023). Our method ensures verifiable, immutable, and reproducible evaluations by executing onchain HTTP requests to hosted Hugging Face endpoints and storing datasets, prompts, and metrics directly onchain. We benchmark the Llama, DeepSeek, and Mistral models on the PISA dataset for academic performance prediction (OECD, 2018), a dataset suitable for fairness evaluation using statistical parity and equal opportunity metrics (Hardt et al., 2016). We also evaluate structured Context Association Metrics derived from the StereoSet dataset (Nadeem et al., 2020) to measure social bias in contextual associations. We further extend our analysis with a multilingual evaluation across English, Spanish, and Portuguese using the Kaleidoscope benchmark (Salazar et al., 2025), revealing cross-linguistic disparities. All code and results are open source, enabling community audits and longitudinal fairness tracking across model versions.

Related papers

Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study [7.0773305889955616]
Large Language Models (LLMs) have shown impressive performance in code generation.<n>LLMs must understand and apply a wide range of language concepts.<n>If the concepts exercised in benchmarks are not representative of those used in real-world projects, evaluations may yield incomplete.
arXiv Detail & Related papers (2026-01-07T10:23:33Z)
Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains [6.357124887141297]
Large Language Models (LLMs) have redefined Machine Translation (MT)<n>LLMs often exhibit uneven performance across language families and specialized domains.<n>We introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs.
arXiv Detail & Related papers (2025-10-09T07:28:30Z)
The AI Language Proficiency Monitor -- Tracking the Progress of LLMs on Multilingual Benchmarks [0.0]
We introduce the AI Language Monitor, a comprehensive benchmark that assesses large language models (LLMs) performance across up to 200 languages.<n>Our benchmark aggregates diverse tasks including translation, question answering, math, and reasoning, using datasets such as FLORES+, MMLU, GSM8K, TruthfulQA, and ARC.<n>We provide an open-source, auto-updating leaderboard and dashboard that supports researchers, developers, and policymakers in identifying strengths and gaps in model performance.
arXiv Detail & Related papers (2025-07-11T12:38:02Z)
Datasets for Fairness in Language Models: An In-Depth Survey [8.198294998446867]
This survey examines the most widely used fairness datasets in current language model research.<n>We introduce a unified evaluation framework that reveals consistent patterns of demographic disparities across datasets and scoring methods.<n>We highlight the often overlooked biases that can influence conclusions about model fairness and offer practical guidance for selecting, combining, and interpreting these datasets.
arXiv Detail & Related papers (2025-06-29T22:11:58Z)
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics [101.78963920333342]
We introduce OpenUnlearning, a standardized framework for benchmarking large language models (LLMs) unlearning methods and metrics.<n>OpenUnlearning integrates 9 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks.<n>We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite.
arXiv Detail & Related papers (2025-06-14T20:16:37Z)
Benchmarking LLM-based Relevance Judgment Methods [15.255877686845773]
Large Language Models (LLMs) are increasingly deployed in both academic and industry settings.<n>We systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods.<n>As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model.
arXiv Detail & Related papers (2025-04-17T01:13:21Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation [13.440594349043916]
We develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG)<n>Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs)<n>We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs.
arXiv Detail & Related papers (2025-02-24T13:58:42Z)
Detecting Training Data of Large Language Models via Expectation Maximization [62.28028046993391]
We introduce EM-MIA, a novel membership inference method that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm.<n> EM-MIA achieves state-of-the-art results on WikiMIA.
arXiv Detail & Related papers (2024-10-10T03:31:16Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
AlignBench: Benchmarking Chinese Alignment of Large Language Models [99.24597941555277]
We introduce AlignBench, a comprehensive benchmark for evaluating Chinese Large Language Models' alignment. We design a human-in-the-loop data curation pipeline, containing eight main categories, 683 real-scenario rooted queries and corresponding human verified references. For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judgecitezheng2023judging approach with Chain-of-Thought to generate explanations and final ratings.
arXiv Detail & Related papers (2023-11-30T17:41:30Z)
L-Eval: Instituting Standardized Evaluation for Long Context Language Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA) We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z)
Alibaba-Translate China's Submission for WMT 2022 Quality Estimation Shared Task [80.22825549235556]
We present our submission to the sentence-level MQM benchmark at Quality Estimation Shared Task, named UniTE. Specifically, our systems employ the framework of UniTE, which combined three types of input formats during training with a pre-trained language model. Results show that our models reach 1st overall ranking in the Multilingual and English-Russian settings, and 2nd overall ranking in English-German and Chinese-English settings.
arXiv Detail & Related papers (2022-10-18T08:55:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.