LogicScore: Fine-grained Logic Evaluation of Conciseness, Completeness, and Determinateness in Attributed Question Answering
- URL: http://arxiv.org/abs/2601.15050v2
- Date: Thu, 22 Jan 2026 02:56:54 GMT
- Title: LogicScore: Fine-grained Logic Evaluation of Conciseness, Completeness, and Determinateness in Attributed Question Answering
- Authors: Zhichao Yan, Yunxiao Zhao, Jiapu Wang, Jiaoyan Chen, Shaoru Guo, Xiaoli Li, Ru Li, Jeff Z. Pan,
- Abstract summary: We present textscLogicScore, a unified evaluation framework that shifts the paradigm from local assessment to global reasoning scrutiny.<n>Our approach integrates a backward verification mechanism to evaluate three key reasoning dimensions: textitCompleteness (logically sound deduction), textitConciseness (non-redundancy) and textitDeterminateness (consistent answer entailment)
- Score: 29.294167109756042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current evaluation methods for Attributed Question Answering (AQA) suffer from \textit{attribution myopia}: they emphasize verification of isolated statements and their attributions but overlook the global logical integrity of long-form answers. Consequently, Large Language Models (LLMs) often produce factually grounded yet logically incoherent responses with elusive deductive gaps. To mitigate this limitation, we present \textsc{LogicScore}, a unified evaluation framework that shifts the paradigm from local assessment to global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: \textit{Completeness} (logically sound deduction), \textit{Conciseness} (non-redundancy), and \textit{Determinateness} (consistent answer entailment). Extensive experiments across three multi-hop QA datasets (HotpotQA, MusiQue, and 2WikiMultiHopQA) and over 20 LLMs (including GPT-5, Gemini-3-Pro, LLaMA3, and task-specific tuned models) reveal a critical capability gap: leading models often achieve high attribution scores (e.g., 92.85\% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11\% Conciseness for Gemini-3 Pro). Our work establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development. Codes are available at: https://github.com/zhichaoyan11/LogicScore.
Related papers
- Multimodal Fact-Level Attribution for Verifiable Reasoning [80.60864342985748]
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation.<n>Existing multimodal grounding benchmarks and evaluation methods fail to assess attribution in complex multimodal reasoning.<n>We introduce MuRGAt, a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation.
arXiv Detail & Related papers (2026-02-12T03:10:02Z) - VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning [4.3414302048068745]
We present a neurosymbolic framework that combines Large Language Models with SMT solvers to produce verification-guided answers.<n>We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking, (2) semantic routing that directs different claim types to appropriate verification strategies, and (3) precise logical error localization via Minimal Correction Subsets.<n>With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.
arXiv Detail & Related papers (2026-01-27T20:59:11Z) - Advancing Academic Chatbots: Evaluation of Non Traditional Outputs [3.0969191504482243]
This research focuses on evaluating whether large language models can generate high quality non-traditional academic outputs.<n>We implement a prototype combining Meta's LLaMA 3 70B open weight and OpenAI's GPT 4o mini API based.<n>GPT 4o mini again performed best, though LLaMA 3 showed promise in narrative coherence.
arXiv Detail & Related papers (2025-11-30T17:25:23Z) - RefineBench: Evaluating Refinement Capability of Language Models via Checklists [71.02281792867531]
We evaluate two refinement modes: guided refinement and self-refinement.<n>In guided refinement, both proprietary LMs and large open-weight LMs can leverage targeted feedback to refine responses to near-perfect levels within five turns.<n>These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses.
arXiv Detail & Related papers (2025-11-27T07:20:52Z) - In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback [38.915062716409686]
InTRO is a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning.<n>InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model.<n>Its chains of thought are notably more concise, exhibiting reduced verbosity.
arXiv Detail & Related papers (2025-11-13T01:47:06Z) - Rethinking Reward Models for Multi-Domain Test-Time Scaling [91.76069784586149]
Prior work generally assumes that process reward models (PRMs) outperform outcome reward models (ORMs) that assess only the final answer.<n>We present the first unified evaluation of four reward model variants across 14 diverse domains.<n>We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories.
arXiv Detail & Related papers (2025-10-01T04:21:14Z) - Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit [114.83867400179354]
Overthinking can degrade overall performance of large language models.<n>We categorize reasoning into three stages: insufficient exploration stage, compensatory reasoning stage, and reasoning convergence stage.<n>We develop a lightweight thresholding strategy based on rules to improve reasoning accuracy.
arXiv Detail & Related papers (2025-08-25T03:17:17Z) - AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning [3.949628618389608]
AURA is a benchmark for evaluating the cross-modal reasoning capabilities of Audio-Visual Large Language Models (AV-LLMs) and Omni-modal Language Models (OLMs)<n>AURA includes questions across six challenging cognitive domains, such as causality, timbre and pitch, tempo and AV synchronization, unanswerability, implicit distractions, and skill profiling.<n>We propose a novel metric, AuraScore, which addresses the lack of robust tools for evaluating reasoning fidelity.
arXiv Detail & Related papers (2025-08-10T20:06:42Z) - Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study [40.143148197878354]
We introduce FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions.<n>We study how different supervision formats in fine-tuning shape reasoning abilities.<n>We find a key trade-off: natural language supervision excels at generalization, whereas symbolic supervision is superior at instilling structurally sound, atomic reasoning steps.
arXiv Detail & Related papers (2025-06-05T09:34:12Z) - A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z) - Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models [11.379764847748378]
Large language models (LLMs) often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs.<n>This emphasizes the significance of possessing the textbfPremise Critique Ability for LLMs, defined as the capacity to proactively identify and articulate errors in input premises.<n>We introduce the textbfPremise Critique Bench (PCBench), designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics.
arXiv Detail & Related papers (2025-05-29T17:49:44Z) - Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models [86.88657425848547]
Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning.<n>We explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks.<n>Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosts performance by over 10% relative to instruction-tuned baselines.
arXiv Detail & Related papers (2025-05-15T17:58:33Z) - Localizing Factual Inconsistencies in Attributable Text Generation [74.11403803488643]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.<n>We show that QASemConsistency yields factual consistency scores that correlate well with human judgments.
arXiv Detail & Related papers (2024-10-09T22:53:48Z) - Making Large Language Models Better Reasoners with Alignment [57.82176656663245]
Reasoning is a cognitive process of using evidence to reach a sound conclusion.
Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities.
We introduce an textitAlignment Fine-Tuning (AFT) paradigm, which involves three steps.
arXiv Detail & Related papers (2023-09-05T11:32:48Z) - Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task.
LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced.
Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z) - Self-Evaluation Guided Beam Search for Reasoning [61.523627290397556]
We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM)
We propose a decoding algorithm integrating the self-evaluation guidance via beam search.
Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
arXiv Detail & Related papers (2023-05-01T02:37:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.