Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis
- URL: http://arxiv.org/abs/2602.10881v1
- Date: Wed, 11 Feb 2026 14:09:43 GMT
- Title: Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis
- Authors: Zhiyin Tan, Jennifer D'Souza,
- Abstract summary: Review and meta-analyses rely on converting narrative articles into structured, numerically grounded study records.<n>Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process.<n>We propose a structural, diagnostic framework that evaluates LLM-based evidence extraction as a progression of schema-constrained queries.
- Score: 0.8193467416247519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Systematic reviews and meta-analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect-size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM-based evidence extraction as a progression of schema-constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom-level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state-of-the-art LLMs under both per-document and long-context, multi-document input regimes. Across domains and models, performance remains moderate for single-property queries but degrades sharply once tasks require stable binding between variables, roles, statistical methods, and effect sizes. Full meta-analytic association tuples are extracted with near-zero reliability, and long-context inputs further exacerbate these failures. Downstream aggregation amplifies even minor upstream errors, rendering corpus-level statistics unreliable. Our analysis shows that these limitations stem not from entity recognition errors, but from systematic structural breakdowns, including role reversals, cross-analysis binding drift, instance compression in dense result sections, and numeric misattribution, indicating that current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta-analysis. The code and data are publicly available at GitHub (https://github.com/zhiyintan/LLM-Meta-Analysis).
Related papers
- DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z) - From Chaos to Clarity: Schema-Constrained AI for Auditable Biomedical Evidence Extraction from Full-Text PDFs [2.136797327390818]
Existing document AI systems are limited by OCR errors, long-document fragmentation, constrained throughput, and insufficient auditability for high-stakes synthesis.<n>We present a schema-constrained AI extraction system that transforms full-text biomedical PDFs into structured, analysis-ready records.
arXiv Detail & Related papers (2025-12-31T00:43:53Z) - Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding [49.26132236798123]
Vision Language Models (VLMs) have gradually become a primary approach in document understanding.<n>We propose SLEUTH, a multi agent framework that orchestrates a retriever and four collaborative agents in a coarse to fine process.<n>The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy.
arXiv Detail & Related papers (2025-11-28T03:09:40Z) - Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z) - Manalyzer: End-to-end Automated Meta-analysis with Multi-agent System [48.093356587573666]
Meta-analysis is a systematic research methodology that synthesizes data from multiple existing studies to derive comprehensive conclusions.<n>Traditional meta-analysis involves a complex multi-stage pipeline including literature retrieval, paper screening, and data extraction.<n>We propose a multi-agent system, Manalyzer, which achieves end-to-end automated meta-analysis through tool calls.
arXiv Detail & Related papers (2025-05-22T07:25:31Z) - HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation [39.7293877954587]
HiMATE is a Hierarchical Multi-Agent Framework for Machine Translation Evaluation.<n>We develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors.<n> Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations.
arXiv Detail & Related papers (2025-05-22T06:24:08Z) - Structured Prompting and Feedback-Guided Reasoning with LLMs for Data Interpretation [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and task generalization.<n>This paper introduces the STROT Framework, a method for structured prompting and feedback-driven transformation logic generation.
arXiv Detail & Related papers (2025-05-03T00:05:01Z) - DataPuzzle: Breaking Free from the Hallucinated Promise of LLMs in Data Analysis [10.98270220152657]
Large language models (LLMs) are increasingly applied to multi-modal data analysis.<n>The prevailing Prompt-to-Answer'' paradigm treats LLMs as black-box analysts.<n>We propose DataPuzzle, a conceptual multi-agent framework that decomposes complex questions.
arXiv Detail & Related papers (2025-04-14T09:38:23Z) - StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.