Related papers: GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

URL: http://arxiv.org/abs/2508.02926v2
Date: Wed, 06 Aug 2025 19:57:38 GMT
Title: GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics
Authors: Arthur Cho,
Abstract summary: Generative Machine Learning models have become central to modern systems, powering applications in creative writing, summarization, multi-hop reasoning, and context-aware dialogue.<n>Standard evaluation regimes still rely on static, benchmark-style tests, incentivizing optimization toward leaderboard scores rather than alignment with dynamic user needs or evolving realities.<n>GrandJury introduces a formal evaluation protocol combining time-decayed aggregation, complete traceability, with the support of dynamic, transparent task attribution, and human judgment.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generative Machine Learning models have become central to modern systems, powering applications in creative writing, summarization, multi-hop reasoning, and context-aware dialogue. These models underpin large-scale AI assistants, workflow automation, and autonomous decision-making. In such domains, acceptable response is rarely absolute or static, but plural and highly context-dependent. Yet standard evaluation regimes still rely on static, benchmark-style tests, incentivizing optimization toward leaderboard scores rather than alignment with dynamic user needs or evolving realities. GrandJury introduces a formal evaluation protocol combining time-decayed aggregation, complete traceability, with the support of dynamic, transparent task rubric attribution, and multi-rater human judgment. Together, these elements enable pluralistic, accountable evaluation that captures evolving consensus and surfaces disagreement. We provide an open-source implementation (grandjury PyPI package) and a public collection of Large Language Model (LLM) inference outputs to illustrate the need and method. GrandJury provides a new paradigm for AI practitioners when evaluating machine learning outputs without absolute ground truth.

Related papers

Adapting Vision-Language Models for Evaluating World Models [24.813041196394582]
We present UNIVERSE, a method for adapting Vision-language Evaluator for Rollouts in Simulated Environments under data and compute constraints.<n>We conduct a large-scale study comparing full, partial, and parameter-efficient finetuning across task formats, context lengths, sampling strategies, and data compositions.<n>The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint.
arXiv Detail & Related papers (2025-06-22T09:53:28Z)
Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey [45.485318955120924]
The transition from traditional large language models (LLMs) to more advanced AI agents represents a pivotal evolutionary step.<n>Existing evaluation frameworks often blur the distinctions between LLM chatbots and AI agents, leading to confusion among researchers selecting appropriate benchmarks.<n>This paper introduces a systematic analysis of current evaluation approaches, grounded in an evolutionary perspective.
arXiv Detail & Related papers (2025-06-06T17:52:18Z)
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z)
Generalising from Self-Produced Data: Model Training Beyond Human Constraints [0.0]
This paper introduces a novel framework in which AI models autonomously generate and validate new knowledge.<n>Central to this approach is an unbounded, ungamable numeric reward that guides learning without requiring human benchmarks.
arXiv Detail & Related papers (2025-04-07T03:48:02Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models. Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems [75.87418236410296]
We introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains. RADDLE is designed to favor and encourage models with a strong generalization ability. We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain.
arXiv Detail & Related papers (2020-12-29T08:58:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.