Enhancing Study-Level Inference from Clinical Trial Papers via RL-based Numeric Reasoning
- URL: http://arxiv.org/abs/2505.22928v1
- Date: Wed, 28 May 2025 22:59:45 GMT
- Title: Enhancing Study-Level Inference from Clinical Trial Papers via RL-based Numeric Reasoning
- Authors: Massimiliano Pronesti, Michela Lorandi, Paul Flanagan, Oisin Redmon, Anya Belz, Yufang Hou,
- Abstract summary: We conceptualise the problem as one of quantitative reasoning.<n>We develop a numeric reasoning system composed of a numeric data extraction model and an effect estimate component.
- Score: 10.449112615828419
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Systematic reviews in medicine play a critical role in evidence-based decision-making by aggregating findings from multiple studies. A central bottleneck in automating this process is extracting numeric evidence and determining study-level conclusions for specific outcomes and comparisons. Prior work has framed this problem as a textual inference task by retrieving relevant content fragments and inferring conclusions from them. However, such approaches often rely on shallow textual cues and fail to capture the underlying numeric reasoning behind expert assessments. In this work, we conceptualise the problem as one of quantitative reasoning. Rather than inferring conclusions from surface text, we extract structured numerical evidence (e.g., event counts or standard deviations) and apply domain knowledge informed logic to derive outcome-specific conclusions. We develop a numeric reasoning system composed of a numeric data extraction model and an effect estimate component, enabling more accurate and interpretable inference aligned with the domain expert principles. We train the numeric data extraction model using different strategies, including supervised fine-tuning (SFT) and reinforcement learning (RL) with a new value reward model. When evaluated on the CochraneForest benchmark, our best-performing approach -- using RL to train a small-scale number extraction model -- yields up to a 21% absolute improvement in F1 score over retrieval-based systems and outperforms general-purpose LLMs of over 400B parameters by up to 9%. Our results demonstrate the promise of reasoning-driven approaches for automating systematic evidence synthesis.
Related papers
- AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology [47.52685298426068]
We systematically evaluate the reasoning capabilities of large language models (LLMs) in anesthesiology.<n>AnesBench is a cross-lingual benchmark designed to assess anesthesiology-related reasoning across three levels.
arXiv Detail & Related papers (2025-04-03T08:54:23Z) - Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents [64.43980129731587]
We propose a causal-inspired inference-time debiasing method called Causal Diagnosis and Correction (CDC)<n>CDC first diagnoses the bias effect of the perplexity and then separates the bias effect from the overall relevance score.<n> Experimental results across three domains demonstrate the superior debiasing effectiveness.
arXiv Detail & Related papers (2025-03-11T17:59:00Z) - Model-free Methods for Event History Analysis and Efficient Adjustment (PhD Thesis) [55.2480439325792]
This thesis is a series of independent contributions to statistics unified by a model-free perspective.<n>The first chapter elaborates on how a model-free perspective can be used to formulate flexible methods that leverage prediction techniques from machine learning.<n>The second chapter studies the concept of local independence, which describes whether the evolution of one process is directly influenced by another.
arXiv Detail & Related papers (2025-02-11T19:24:09Z) - Federated Causal Inference: Multi-Study ATE Estimation beyond Meta-Analysis [12.896319628045967]
We study Federated Causal Inference, an approach to estimate treatment effects from decentralized data across centers.<n>We compare three classes of Average Treatment Effect (ATE) estimators derived from the Plug-in G-Formula.
arXiv Detail & Related papers (2024-10-22T10:19:17Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Key Design Choices in Source-Free Unsupervised Domain Adaptation: An
In-depth Empirical Analysis [16.0130560365211]
This study provides a benchmark framework for Source-Free Unsupervised Domain Adaptation (SF-UDA) in image classification.
The study empirically examines a diverse set of SF-UDA techniques, assessing their consistency across datasets.
It exhaustively evaluates pre-training datasets and strategies, particularly focusing on both supervised and self-supervised methods.
arXiv Detail & Related papers (2024-02-25T13:37:36Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Jointly Extracting Interventions, Outcomes, and Findings from RCT
Reports with LLMs [21.868871974136884]
We propose and evaluate a text-to-text model built on instruction-tuned Large Language Models.
We apply our model to a collection of published RCTs through mid-2022, and release a searchable database of structured findings.
arXiv Detail & Related papers (2023-05-05T16:02:06Z) - A framework for causal segmentation analysis with machine learning in
large-scale digital experiments [0.0]
We present an end-to-end methodological framework for causal segment discovery.
Our approach unifies two objectives: (1) the discovery of user segments that stand to benefit from a candidate treatment based on subgroup-specific treatment effects, and (2) the evaluation of causal impacts of dynamically assigning units to a study's treatment arm based on their predicted segment-specific benefit or harm.
arXiv Detail & Related papers (2021-11-01T19:22:27Z) - SAIS: Supervising and Augmenting Intermediate Steps for Document-Level
Relation Extraction [51.27558374091491]
We propose to explicitly teach the model to capture relevant contexts and entity types by supervising and augmenting intermediate steps (SAIS) for relation extraction.
Based on a broad spectrum of carefully designed tasks, our proposed SAIS method not only extracts relations of better quality due to more effective supervision, but also retrieves the corresponding supporting evidence more accurately.
arXiv Detail & Related papers (2021-09-24T17:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.