Related papers: ReportLogic: Evaluating Logical Quality in Deep Research Reports

ReportLogic: Evaluating Logical Quality in Deep Research Reports

URL: http://arxiv.org/abs/2602.18446v1
Date: Tue, 27 Jan 2026 14:06:33 GMT
Title: ReportLogic: Evaluating Logical Quality in Deep Research Reports
Authors: Jujia Zhao, Zhaoxin Huan, Zihan Wang, Xiaolu Zhang, Jun Zhou, Suzan Verberne, Zhaochun Ren,
Abstract summary: ReportLogic is a benchmark that quantifies report-level logical quality.<n>We construct a human-annotated rubric and train an open-source LogicJudge for scalable evaluation.
Score: 44.97940942982868
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Users increasingly rely on Large Language Models (LLMs) for Deep Research, using them to synthesize diverse sources into structured reports that support understanding and action. In this context, the practical reliability of such reports hinges on logical quality: whether the report's claims and arguments are explicitly supported and can be trusted as a basis for downstream use, rather than merely appearing fluent or informative. However, current evaluation frameworks largely overlook this requirement. To bridge this gap, we introduce ReportLogic, a benchmark that quantifies report-level logical quality through a reader-centric lens of auditability. Specifically, ReportLogic adopts a hierarchical taxonomy that evaluates whether readers can (1) trace an on-topic report structure with a unified analytical arc (Macro-Logic), (2) understand the progression with necessary context (Expositional-Logic), and (3) verify conclusions via explicit claim--support (Structural-Logic). Based on this taxonomy, we construct a human-annotated rubric-guided dataset and train an open-source LogicJudge for scalable evaluation. We further evaluate judge robustness via adversarial attacks, showing that off-the-shelf LLM judges are frequently influenced by superficial cues (e.g., verbosity), and reasoning modes can mask broken support relations. Overall, our results provide actionable guidance for building more robust logic evaluators and improving the logical reliability of LLM-generated reports.

Related papers

Last Layer Logits to Logic: Empowering LLMs with Logic-Consistent Structured Knowledge Reasoning [55.55968342644846]
Large Language Models (LLMs) achieve excellent performance in natural language reasoning tasks through pre-training on vast unstructured text.<n>We propose the textitLogits-to-Logic framework, which incorporates logits strengthening and logits filtering as core modules to correct logical defects in LLM outputs.
arXiv Detail & Related papers (2025-11-11T07:08:27Z)
Exploratory Semantic Reliability Analysis of Wind Turbine Maintenance Logs using Large Language Models [0.0]
This paper addresses the gap in leveraging modern large language models (LLMs) for more complex reasoning tasks.<n>We introduce an exploratory framework that uses LLMs to move beyond classification and perform semantic analysis.<n>The results demonstrate that LLMs can function as powerful "reliability co-pilots," moving beyond labelling to synthesise textual information and actionable, expert-level hypotheses.
arXiv Detail & Related papers (2025-09-26T14:00:20Z)
Implicit Reasoning in Large Language Models: A Comprehensive Survey [67.53966514728383]
Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks.<n>Recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning.<n>This survey introduces a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies.
arXiv Detail & Related papers (2025-09-02T14:16:02Z)
SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis [39.229080120880774]
We introduce SV-TrustEval-C, a benchmark designed to evaluate Large Language Models' abilities for vulnerability analysis of code written in the C programming language.<n>Our results show that current LLMs are far from satisfactory in understanding complex code relationships and that their vulnerability analyses rely more on pattern matching than on robust logical reasoning.
arXiv Detail & Related papers (2025-05-27T02:16:27Z)
Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLM [11.181783720439563]
Large Language Models (LLMs) display sophisticated reasoning abilities via extended Chain-of-Thought (CoT) generation.<n>RLMs often demonstrate counterintuitive and unstable behaviors, such as performance degradation under few-shot prompting.<n>We introduce a unified graph-based analytical framework for better modeling the reasoning processes of RLMs.
arXiv Detail & Related papers (2025-05-20T03:54:57Z)
SCAN: Structured Capability Assessment and Navigation for LLMs [54.54085382131134]
textbfSCAN (Structured Capability Assessment and Navigation) is a practical framework that enables detailed characterization of Large Language Models.<n>SCAN incorporates four key components:.<n>TaxBuilder, which extracts capability-indicating tags from queries to construct a hierarchical taxonomy;.<n>RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag;.<n>A PC$2$-based (Pre-Comparison-derived Criteria) LLM-as-a-Judge approach achieves significantly higher accuracy compared to classic LLM-as-a-Judge method
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
Structured Prompting and Feedback-Guided Reasoning with LLMs for Data Interpretation [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and task generalization.<n>This paper introduces the STROT Framework, a method for structured prompting and feedback-driven transformation logic generation.
arXiv Detail & Related papers (2025-05-03T00:05:01Z)
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation [19.46864730994867]
We introduce textbfCOVER (textbfunderlineCOunterfactual textbfunderlineEo textbfunderlineReasoning), a multidimensional multimodal benchmark.<n>It decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis.
arXiv Detail & Related papers (2025-03-12T03:25:51Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
Modeling Hierarchical Reasoning Chains by Linking Discourse Units and Key Phrases for Reading Comprehension [80.99865844249106]
We propose a holistic graph network (HGN) which deals with context at both discourse level and word level, as the basis for logical reasoning. Specifically, node-level and type-level relations, which can be interpreted as bridges in the reasoning process, are modeled by a hierarchical interaction mechanism.
arXiv Detail & Related papers (2023-06-21T07:34:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.