Related papers: Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation

Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation

URL: http://arxiv.org/abs/2601.17915v2
Date: Thu, 29 Jan 2026 18:18:39 GMT
Title: Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation
Authors: Saurabh Jha, Rohan Arora, Bhavya, Noah Zheutlin, Paulina Toro Isaza, Laura Shwartz, Yu Deng, Daby Sow, Ruchi Mahindru, Ruchir Puri,
Abstract summary: LLM agents excel when environments are mostly static and the needed information fits in a model's context window.<n>ReAct-style agents are especially brittle in this regime.<n>We propose EoG, a framework in which an LLM performs bounded local evidence mining and labeling (cause vs symptom) while a deterministic controller manages, state, and belief propagation to compute a minimal explanatory frontier.
Score: 5.191980417814362
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM agents excel when environments are mostly static and the needed information fits in a model's context window, but they often fail in open-ended investigations where explanations must be constructed by iteratively mining evidence from massive, heterogeneous operational data. These investigations exhibit hidden dependency structure: entities interact, signals co-vary, and the importance of a fact may only become clear after other evidence is discovered. Because the context window is bounded, agents must summarize intermediate findings before their significance is known, increasing the risk of discarding key evidence. ReAct-style agents are especially brittle in this regime. Their retrieve-summarize-reason loop makes conclusions sensitive to exploration order and introduces run-to-run non-determinism, producing a reliability gap where Pass-at-k may be high but Majority-at-k remains low. Simply sampling more rollouts or generating longer reasoning traces does not reliably stabilize results, since hypotheses cannot be autonomously checked as new evidence arrives and there is no explicit mechanism for belief bookkeeping and revision. In addition, ReAct entangles semantic reasoning with controller duties such as tool orchestration and state tracking, so execution errors and plan drift degrade reasoning while consuming scarce context. We address these issues by formulating investigation as abductive reasoning over a dependency graph and proposing EoG (Explanations over Graphs), a disaggregated framework in which an LLM performs bounded local evidence mining and labeling (cause vs symptom) while a deterministic controller manages traversal, state, and belief propagation to compute a minimal explanatory frontier. On a representative ITBench diagnostics task, EoG improves both accuracy and run-to-run consistency over ReAct baselines, including a 7x average gain in Majority-at-k entity F1.

Related papers

CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching [50.65932158912512]
We propose a new causal reasoning benchmark, CausalFlip, to encourage the development of new large language models.<n>CaulFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations.<n>We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought supervision, and a proposed internalized causal reasoning approach.
arXiv Detail & Related papers (2026-02-23T18:06:15Z)
Causality is Key for Interpretability Claims to Generalise [35.833847356014154]
Interpretability research on large language models (LLMs) has yielded important insights into model behaviour.<n> recurring pitfalls persist: findings that do not generalise, and causal interpretations that outrun the evidence.<n>Pearl's causal hierarchy clarifies what an interpretability study can justify.
arXiv Detail & Related papers (2026-02-18T18:45:04Z)
Mitigating Safety Tax via Distribution-Grounded Refinement in Large Reasoning Models [63.368505631152594]
Safety alignment incurs safety tax that perturbs a large reasoning model's (LRM) general reasoning ability.<n>Existing datasets used for safety alignment for an LRM are usually constructed by distilling safety reasoning traces and answers from an external LRM or human labeler.<n>We propose a safety alignment dataset construction method, dubbed DGR. DGR transforms and refines an existing out-of-distributional safety reasoning dataset to be aligned with the target's LLM inner distribution.
arXiv Detail & Related papers (2026-02-02T14:18:48Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs [53.199517625701475]
CoG is a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation.<n>CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.
arXiv Detail & Related papers (2026-01-16T07:27:40Z)
Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z)
Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents [0.0]
We introduce textbfProject Ariadne, a novel XAI framework to audit the causal integrity of agentic reasoning.<n>Unlike existing interpretability methods that rely on surface-level textual similarity, Project Ariadne performs textbfhard interventions ($do$-calculus) on intermediate reasoning nodes.<n>Our empirical evaluation of state-of-the-art models reveals a persistent textitFaithfulness Gap.
arXiv Detail & Related papers (2026-01-05T18:05:29Z)
Abductive Inference in Retrieval-Augmented Language Models: Generating and Validating Missing Premises [0.0]
We propose a framework that integrates abductive inference into retrieval-augmented LLMs.<n> Experimental results on abductive reasoning and multi-hop QA benchmarks show that our approach improves both answer accuracy and reasoning faithfulness.<n>This work highlights abductive inference as a promising direction for enhancing the robustness and explainability of RAG systems.
arXiv Detail & Related papers (2025-11-06T03:37:24Z)
MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models [43.872922223495586]
Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited.<n>We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response.<n>We propose MR-ALIGN, a framework that enhances factuality without relying on external verifiers.
arXiv Detail & Related papers (2025-10-27T15:00:54Z)
VAR: Visual Attention Reasoning via Structured Search and Backtracking [49.427842994857635]
We introduce Visual Attention Reasoning, a framework that recasts grounded reasoning as a structured search.<n> VAR decomposes the reasoning process into two key stages: traceable evidence grounding and search-based chain-of-thought.<n>We show that our 7B model, VAR-7B, sets a new state-of-the-art on a comprehensive suite of hallucination and safety benchmarks.
arXiv Detail & Related papers (2025-10-21T13:18:44Z)
Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z)
Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations [43.491353243991284]
We introduce Robust Rule Induction, a task that evaluates large language models' capability in inferring rules from data that are fused with noisy examples.<n> Experiments across arithmetic, cryptography, and list functions reveal: (1) SRR outperforms other methods with minimal performance degradation under noise; (2) Despite slight accuracy variation, LLMs exhibit instability under noise; and (3) Counterfactual task gaps highlight LLMs' reliance on memorized patterns over genuine abstraction.
arXiv Detail & Related papers (2025-02-22T10:03:19Z)
Retrieving Classes of Causal Orders with Inconsistent Knowledge Bases [0.8192907805418583]
Large Language Models (LLMs) have emerged as a promising alternative for extracting causal knowledge from text-based metadata.<n>LLMs tend to be unreliable and prone to hallucinations, necessitating strategies that account for their limitations.<n>We present a new method to derive a class of acyclic tournaments, which represent plausible causal orders.
arXiv Detail & Related papers (2024-12-18T16:37:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.