Related papers: C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

URL: http://arxiv.org/abs/2603.05167v1
Date: Thu, 05 Mar 2026 13:36:47 GMT
Title: C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
Authors: Avni Mittal, Rauno Arike,
Abstract summary: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning.<n>We introduce C2-Faith, a benchmark that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage.<n>We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring.
Score: 0.6138671548064355
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation

Related papers

The Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation [17.386684382460242]
Large language models (LLMs) are increasingly used to evaluate system outputs in tasks such as reasoning, question answering, and creative writing.<n>We test this ideal via controlled cue perturbations-synthetic metadata labels injected into evaluation prompts for six judge models.<n>We study six cue families: source, temporal, age, gender, ethnicity, and educational status.
arXiv Detail & Related papers (2026-02-08T14:45:23Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge [17.555073770285095]
Large language models (LLMs) are increasingly deployed as automatic judges to evaluate system outputs in tasks such as summarization, dialogue, and creative writing.<n>We show that current LLM judges fail on both counts by relying on shortcuts introduced in the prompt.
arXiv Detail & Related papers (2025-09-30T10:48:08Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z)
When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity [21.192000569821943]
We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise.<n>We show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty.<n>Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware benchmarks.
arXiv Detail & Related papers (2025-09-24T16:26:47Z)
CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z)
Judging LLMs on a Simplex [2.088672652658465]
A common practice is to use large language models (LLMs) themselves as judges, but the theoretical properties of this approach are not yet well understood.<n>We show that a geometric framework that represents both judges and candidates as points on a probability simplex can provide helpful insight on what is or is not identifiable.
arXiv Detail & Related papers (2025-05-28T04:50:41Z)
JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z)
Making Large Language Models Better Reasoners with Alignment [57.82176656663245]
Reasoning is a cognitive process of using evidence to reach a sound conclusion. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. We introduce an textitAlignment Fine-Tuning (AFT) paradigm, which involves three steps.
arXiv Detail & Related papers (2023-09-05T11:32:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.