Large Language Models for Full-Text Methods Assessment: A Case Study on Mediation Analysis
- URL: http://arxiv.org/abs/2510.10762v1
- Date: Sun, 12 Oct 2025 19:04:22 GMT
- Title: Large Language Models for Full-Text Methods Assessment: A Case Study on Mediation Analysis
- Authors: Wenqing Zhang, Trang Nguyen, Elizabeth A. Stuart, Yiqun T. Chen,
- Abstract summary: Large language models (LLMs) offer potential for automating methodological assessments.<n>We benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles.
- Score: 15.98124151893659
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Systematic reviews are crucial for synthesizing scientific evidence but remain labor-intensive, especially when extracting detailed methodological information. Large language models (LLMs) offer potential for automating methodological assessments, promising to transform evidence synthesis. Here, using causal mediation analysis as a representative methodological domain, we benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles. Model performance closely correlated with human judgments (accuracy correlation 0.71; F1 correlation 0.97), achieving near-human accuracy on straightforward, explicitly stated methodological criteria. However, accuracy sharply declined on complex, inference-intensive assessments, lagging expert reviewers by up to 15%. Errors commonly resulted from superficial linguistic cues -- for instance, models frequently misinterpreted keywords like "longitudinal" or "sensitivity" as automatic evidence of rigorous methodological approache, leading to systematic misclassifications. Longer documents yielded lower model accuracy, whereas publication year showed no significant effect. Our findings highlight an important pattern for practitioners using LLMs for methods review and synthesis from full texts: current LLMs excel at identifying explicit methodological features but require human oversight for nuanced interpretations. Integrating automated information extraction with targeted expert review thus provides a promising approach to enhance efficiency and methodological rigor in evidence synthesis across diverse scientific fields.
Related papers
- Disagreement as Data: Reasoning Trace Analytics in Multi-Agent Systems [0.0]
We propose that reasoning traces generated by large language model (LLM) agents constitute a novel and rich form of process data.<n>We apply cosine similarity to reasoning traces to systematically detect, quantify, and interpret disagreements among agents.
arXiv Detail & Related papers (2026-01-18T23:19:49Z) - AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research [81.04845910798387]
Generating natural language explanations for threat detections remains an open problem in cybersecurity research.<n>We present AutoMalDesc, an automated static analysis summarization framework that operates independently at scale.<n>We publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) datasets, along with our methodology and evaluation framework.
arXiv Detail & Related papers (2025-11-17T13:05:25Z) - Autoformalizer with Tool Feedback [52.334957386319864]
Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
arXiv Detail & Related papers (2025-10-08T10:25:12Z) - Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review [18.045716459188366]
Item difficulty plays a crucial role in test performance, interpretability of scores, and equity for all test-takers, especially in large-scale assessments.<n>Traditional approaches to item difficulty modeling rely on field testing and classical test theory (CTT)-based item analysis or item response theory (IRT) calibration.<n>This paper reviews and synthesizes 37 articles on automated item difficulty prediction in large-scale assessment settings published through May 2025.
arXiv Detail & Related papers (2025-09-27T20:19:39Z) - Avaliação de eficiência na leitura: uma abordagem baseada em PLN [0.0]
This study proposes an automated evaluation model for the cloze test in Brazilian Portuguese.<n>The integrated method demonstrated its effectiveness, achieving a high correlation with human evaluation.
arXiv Detail & Related papers (2025-08-18T02:21:12Z) - Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback [81.0031690510116]
We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages.<n>Our method is informed by a large scale analysis of human written novelty reviews.<n> Evaluated on 182 ICLR 2025 submissions, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions.
arXiv Detail & Related papers (2025-08-14T16:18:37Z) - Empowering Meta-Analysis: Leveraging Large Language Models for Scientific Synthesis [7.059964549363294]
This study investigates the automation of meta-analysis in scientific documents using large language models (LLMs)
Our research introduces a novel approach that fine-tunes the LLM on extensive scientific datasets to address challenges in big data handling and structured data extraction.
arXiv Detail & Related papers (2024-11-16T20:18:57Z) - Enhancing Spectral Knowledge Interrogation: A Reliable Retrieval-Augmented Generative Framework on Large Language Models [0.0]
Large Language Model (LLM) has demonstrated significant success in a range of natural language processing (NLP) tasks within general domain.<n>We introduce the Spectral Detection and Analysis Based Paper (SDAAP) dataset, which is the first open-source textual knowledge dataset for spectral analysis and detection.<n>We also designed an automated Q&A framework based on the SDAAP dataset, which can retrieve relevant knowledge and generate high-quality responses.
arXiv Detail & Related papers (2024-08-21T12:09:37Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Uncertainty in Automated Ontology Matching: Lessons Learned from an
Empirical Experimentation [6.491645162078057]
Ontologies play a critical role in link and semantically integrate datasets via interoperability.
This paper approaches data integration from an application perspective, looking techniques based on ontology matching.
arXiv Detail & Related papers (2023-10-18T05:42:51Z) - Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs)
We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing.
We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.