C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
- URL: http://arxiv.org/abs/2603.05167v1
- Date: Thu, 05 Mar 2026 13:36:47 GMT
- Title: C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
- Authors: Avni Mittal, Rauno Arike,
- Abstract summary: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning.<n>We introduce C2-Faith, a benchmark that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage.<n>We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring.
- Score: 0.6138671548064355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation
Related papers
- The Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation [17.386684382460242]
Large language models (LLMs) are increasingly used to evaluate system outputs in tasks such as reasoning, question answering, and creative writing.<n>We test this ideal via controlled cue perturbations-synthetic metadata labels injected into evaluation prompts for six judge models.<n>We study six cue families: source, temporal, age, gender, ethnicity, and educational status.
arXiv Detail & Related papers (2026-02-08T14:45:23Z) - Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z) - The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge [17.555073770285095]
Large language models (LLMs) are increasingly deployed as automatic judges to evaluate system outputs in tasks such as summarization, dialogue, and creative writing.<n>We show that current LLM judges fail on both counts by relying on shortcuts introduced in the prompt.
arXiv Detail & Related papers (2025-09-30T10:48:08Z) - Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z) - TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z) - When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity [21.192000569821943]
We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise.<n>We show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty.<n>Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware benchmarks.
arXiv Detail & Related papers (2025-09-24T16:26:47Z) - CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z) - Judging LLMs on a Simplex [2.088672652658465]
A common practice is to use large language models (LLMs) themselves as judges, but the theoretical properties of this approach are not yet well understood.<n>We show that a geometric framework that represents both judges and candidates as points on a probability simplex can provide helpful insight on what is or is not identifiable.
arXiv Detail & Related papers (2025-05-28T04:50:41Z) - JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance.
We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods.
In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z) - Making Large Language Models Better Reasoners with Alignment [57.82176656663245]
Reasoning is a cognitive process of using evidence to reach a sound conclusion.
Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities.
We introduce an textitAlignment Fine-Tuning (AFT) paradigm, which involves three steps.
arXiv Detail & Related papers (2023-09-05T11:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.