Related papers: Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning

Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning

URL: http://arxiv.org/abs/2510.08710v1
Date: Thu, 09 Oct 2025 18:15:28 GMT
Title: Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning
Authors: Li Zhang, Matthias Grabmair, Morgan Gray, Kevin Ashley,
Abstract summary: We propose a framework that decomposes the process of identifying significant distinctions between cases into three-stage reasoning tasks.<n>Our framework models cases using factual predicates called factors, organizes them into a legal knowledge hierarchy, and defines verifiable rules for identifying distinctions.<n>We find that models consistently expend more computational resources on incorrect responses than correct ones, suggesting that "thinking longer" does not always mean "thinking smarter"
Score: 11.255428720705204
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Case-based reasoning is a cornerstone of U.S. legal practice, requiring professionals to argue about a current case by drawing analogies to and distinguishing from past precedents. While Large Language Models (LLMs) have shown remarkable capabilities, their proficiency in this complex, nuanced form of reasoning needs further investigation. We propose a formal framework that decomposes the process of identifying significant distinctions between cases into three-stage reasoning tasks. Our framework models cases using factual predicates called factors, organizes them into a legal knowledge hierarchy, and defines verifiable rules for identifying distinctions, analyzing their argumentative support, and evaluating their significance. Through comprehensive evaluation of modern reasoning LLMs, we reveal a paradox: while models achieve high accuracy on surface-level reasoning (Task 1), performance degrades on hierarchical reasoning (Task 2: 64.82%-92.09%) and collapses on integrated analysis (Task 3: 11.46%-33.99%). Most strikingly, we find that models consistently expend more computational resources on incorrect responses than correct ones, suggesting that "thinking longer" does not always mean "thinking smarter." Our work provides a methodology for fine-grained analysis of LLM reasoning capabilities in complex domains and reveals fundamental limitations that must be addressed for robust and trustworthy legal AI.

Related papers

CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching [50.65932158912512]
We propose a new causal reasoning benchmark, CausalFlip, to encourage the development of new large language models.<n>CaulFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations.<n>We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought supervision, and a proposed internalized causal reasoning approach.
arXiv Detail & Related papers (2026-02-23T18:06:15Z)
Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning [50.352417879912515]
Large language models (LLMs) excel at complex tasks with advances in reasoning capabilities.<n>We propose Group Causal Counterfactual Policy Optimization to explicitly train LLMs to learn generalizable reasoning patterns.<n>We then construct token-level advantages from this reward and optimize the policy, encouraging LLMs to favor reasoning patterns that are process-valid and counterfactually robust.
arXiv Detail & Related papers (2026-02-06T08:03:11Z)
Do LLMs Truly Understand When a Precedent Is Overruled? [3.5784933879188796]
Large language models (LLMs) with extended context windows show promise for complex legal reasoning tasks.<n>We present an assessment of state-of-the-art LLMs on identifying overruling relationships from U.S. Supreme Court cases.
arXiv Detail & Related papers (2025-10-23T19:07:42Z)
Implicit Reasoning in Large Language Models: A Comprehensive Survey [67.53966514728383]
Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks.<n>Recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning.<n>This survey introduces a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies.
arXiv Detail & Related papers (2025-09-02T14:16:02Z)
On Verifiable Legal Reasoning: A Multi-Agent Framework with Formalized Knowledge Representations [0.0]
This paper introduces a modular multi-agent framework that decomposes legal reasoning into distinct knowledge acquisition and application stages.<n>In the first stage, specialized agents extract legal concepts and formalize rules to create verifiable intermediate representations of statutes.<n>The second stage applies this knowledge to specific cases through three steps: analyzing queries to map case facts onto the schema, performing symbolic inference to derive logically entailed conclusions, and generating final answers.
arXiv Detail & Related papers (2025-08-31T06:03:00Z)
Cognitive Decision Routing in Large Language Models: When to Think Fast, When to Think Slow [0.0]
Large Language Models (LLMs) face a fundamental challenge in deciding when to rely on rapid, intuitive responses versus engaging in slower, more deliberate reasoning.<n>Inspired by Daniel Kahneman's dual-process theory and his insights on human cognitive biases, we propose a novel Cognitive Decision Routing framework.
arXiv Detail & Related papers (2025-08-17T01:07:58Z)
Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study [40.143148197878354]
We introduce FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions.<n>We study how different supervision formats in fine-tuning shape reasoning abilities.<n>We find a key trade-off: natural language supervision excels at generalization, whereas symbolic supervision is superior at instilling structurally sound, atomic reasoning steps.
arXiv Detail & Related papers (2025-06-05T09:34:12Z)
An Explicit Syllogistic Legal Reasoning Framework for Large Language Models [5.501226256903341]
Large language models (LLMs) can answer legal questions, but often struggle with explicit syllogistic reasoning.<n>We introduce SyLeR, a novel framework designed to enable LLMs to perform explicit syllogistic legal reasoning.<n>SyLeR employs a tree-structured hierarchical retrieval mechanism to synthesize relevant legal statutes and precedents.
arXiv Detail & Related papers (2025-04-05T03:34:51Z)
Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning [52.83539473110143]
We introduce a novel structure-oriented analysis method to help Large Language Models (LLMs) better understand a question. To further improve the reliability in complex question-answering tasks, we propose a multi-agent reasoning system, Structure-oriented Autonomous Reasoning Agents (SARA) Extensive experiments verify the effectiveness of the proposed reasoning system. Surprisingly, in some cases, the system even surpasses few-shot methods.
arXiv Detail & Related papers (2024-10-18T05:30:33Z)
Self-Contradictory Reasoning Evaluation and Detection [31.452161594896978]
We investigate self-contradictory (Self-Contra) reasoning, where the model reasoning does not support its answers. We find that LLMs often contradict themselves in reasoning tasks involving contextual information understanding or commonsense. We find that GPT-4 can detect Self-Contra with a 52.2% F1 score, much lower compared to 66.7% for humans.
arXiv Detail & Related papers (2023-11-16T06:22:17Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
DetermLR: Augmenting LLM-based Logical Reasoning from Indeterminacy to Determinacy [76.58614128865652]
We propose DetermLR, a novel perspective that rethinks the reasoning process as an evolution from indeterminacy to determinacy. First, we categorize known conditions into two types: determinate and indeterminate premises This provides an oveall direction for the reasoning process and guides LLMs in converting indeterminate data into progressively determinate insights. We automate the storage and extraction of available premises and reasoning paths with reasoning memory, preserving historical reasoning details for subsequent reasoning steps.
arXiv Detail & Related papers (2023-10-28T10:05:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.