Related papers: JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks

JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks

URL: http://arxiv.org/abs/2602.06486v1
Date: Fri, 06 Feb 2026 08:26:09 GMT
Title: JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks
Authors: Lanbo Lin, Jiayao Liu, Tianyuan Yang, Li Cai, Yuanwu Xu, Lei Wei, Sicong Xie, Guannan Zhang,
Abstract summary: We propose JADE, a two-layer evaluation framework for agentic AI.<n> Layer 1 encodes expert knowledge as a predefined set of evaluation skills.<n> Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies.
Score: 14.14645345504797
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating agentic AI on open-ended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate diverse valid response strategies, while LLM-as-a-judge approaches adapt to individual responses yet suffer from instability and bias. Human experts address this dilemma by combining domain-grounded principles with dynamic, claim-level assessment. Inspired by this process, we propose JADE, a two-layer evaluation framework. Layer 1 encodes expert knowledge as a predefined set of evaluation skills, providing stable evaluation criteria. Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies, with evidence-dependency gating to invalidate conclusions built on refuted claims. Experiments on BizBench show that JADE improves evaluation stability and reveals critical agent failure modes missed by holistic LLM-based evaluators. We further demonstrate strong alignment with expert-authored rubrics and effective transfer to a medical-domain benchmark, validating JADE across professional domains. Our code is publicly available at https://github.com/smiling-world/JADE.

Related papers

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems [2.9141470183751674]
We propose a dynamic, learning-based framework for scalable and context-aware evaluation.<n>Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts.<n> Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines.
arXiv Detail & Related papers (2025-12-01T15:26:20Z)
JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer [19.09571232466437]
We propose Agent-as-Interviewer, a dynamic evaluation paradigm for large language models (LLMs)<n>Unlike current benchmarking or dynamic interaction paradigms, Agent-as-Interviewer utilizes agents to invoke knowledge tools for wider and deeper knowledge in the dynamic multi-turn question generation.<n>We develop JudgeAgent, a knowledge-wise dynamic evaluation framework that employs knowledge-driven synthesis as the agent's tool and uses difficulty scoring as strategy guidance.
arXiv Detail & Related papers (2025-09-02T08:52:16Z)
Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z)
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning [54.85131761693927]
We introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions.<n>Our core contribution lies in converting all judgment tasks for non-verifiable and verifiable prompts into a unified format with verifiable rewards.<n>We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance.
arXiv Detail & Related papers (2025-05-15T14:05:15Z)
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications [0.0]
Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations.<n>Traditional evaluation methods, which rely on word overlap or text embeddings, are inadequate for capturing the nuanced semantic information necessary to evaluate dynamic, open-ended text generation.<n>We propose a novel dynamic multi-agent system that automatically designs personalized LLM judges for various natural language generation applications.
arXiv Detail & Related papers (2025-04-01T09:36:56Z)
DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering [12.879551933541345]
We propose the Dynamic Arbitration Framework for Evaluation (DAFE) to evaluate large language models.<n>DAFE employs two primary LLM-as-judges and engages a third arbitrator only in cases of disagreements.<n>We show DAFE's ability to provide consistent, scalable, and resource-efficient assessments.
arXiv Detail & Related papers (2025-03-11T15:29:55Z)
EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta [2.1249213103048414]
We introduce the EQUATOR Evaluator, which combines deterministic scoring with a focus on factual accuracy and robust reasoning assessment.<n>Using a vector database, EQUATOR pairs open-ended questions with human-evaluated answers, enabling more precise and scalable evaluations.<n>Our results demonstrate that this framework significantly outperforms traditional multiple-choice evaluations while maintaining high accuracy standards.
arXiv Detail & Related papers (2024-12-31T03:56:17Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference. HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators. Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z)
Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.<n>This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)<n>We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.