LLM-Assisted Abstract Screening with OLIVER: Evaluating Calibration and Single-Model vs. Actor-Critic Configurations in Literature Reviews
- URL: http://arxiv.org/abs/2512.20022v1
- Date: Tue, 23 Dec 2025 03:32:43 GMT
- Title: LLM-Assisted Abstract Screening with OLIVER: Evaluating Calibration and Single-Model vs. Actor-Critic Configurations in Literature Reviews
- Authors: Kian Godhwani, David Benrimoh,
- Abstract summary: We developed OLIVER (d LLM-based Inclusion and Vetting Engine for Reviews), an open-source pipeline for abstract screening.<n>We evaluated multiple contemporary LLMs across two non-Cochrane systematic reviews and performance was assessed at both the full-text screening and final inclusion stages.<n>Actor-critic screening improved discrimination and markedly reduced calibration error in both reviews, yielding higher AUCs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Introduction: Recent work suggests large language models (LLMs) can accelerate screening, but prior evaluations focus on earlier LLMs, standardized Cochrane reviews, single-model setups, and accuracy as the primary metric, leaving generalizability, configuration effects, and calibration largely unexamined. Methods: We developed OLIVER (Optimized LLM-based Inclusion and Vetting Engine for Reviews), an open-source pipeline for LLM-assisted abstract screening. We evaluated multiple contemporary LLMs across two non-Cochrane systematic reviews and performance was assessed at both the full-text screening and final inclusion stages using accuracy, AUC, and calibration metrics. We further tested an actor-critic screening framework combining two lightweight models under three aggregation rules. Results: Across individual models, performance varied widely. In the smaller Review 1 (821 abstracts, 63 final includes), several models achieved high sensitivity for final includes but at the cost of substantial false positives and poor calibration. In the larger Review 2 (7741 abstracts, 71 final includes), most models were highly specific but struggled to recover true includes, with prompt design influencing recall. Calibration was consistently weak across single-model configurations despite high overall accuracy. Actor-critic screening improved discrimination and markedly reduced calibration error in both reviews, yielding higher AUCs. Discussion: LLMs may eventually accelerate abstract screening, but single-model performance is highly sensitive to review characteristics, prompting, and calibration is limited. An actor-critic framework improves classification quality and confidence reliability while remaining computationally efficient, enabling large-scale screening at low cost.
Related papers
- DeepCritic: Deliberate Critique with Large Language Models [77.5516314477878]
We focus on studying and enhancing the math critique ability of Large Language Models (LLMs)<n>Our developed critique model built on Qwen2.5-7B-Instruct significantly outperforms existing LLM critics on various error identification benchmarks.
arXiv Detail & Related papers (2025-05-01T17:03:17Z) - Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles [4.477423478591491]
Calib-n is a novel framework that trains an auxiliary model for confidence estimation.<n>We find that few-shot prompts are the most effective for auxiliary model-based methods.
arXiv Detail & Related papers (2025-01-07T18:48:42Z) - LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.<n>We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.<n>LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z) - Training Language Models to Critique With Multi-agent Feedback [102.42751835338233]
MultiCritique pipeline improves critique ability of LLMs by utilizing multi-agent feedback.
pipeline aggregates high-quality critiques from multiple agents instead of a single model.
Our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models.
arXiv Detail & Related papers (2024-10-20T04:57:45Z) - Learning to Refine with Fine-Grained Natural Language Feedback [81.70313509881315]
We propose looking at refinement with feedback as a composition of three distinct LLM competencies.<n>A key property of the proposed Detect, Critique, Refine ("DCR") method is that the step 2 critique model can give fine-grained feedback about errors.<n>We show that models of different capabilities benefit from refining with DCR on the task of improving factual consistency of document grounded summaries.
arXiv Detail & Related papers (2024-07-02T16:15:01Z) - Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models.
We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data.
Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z) - LitCab: Lightweight Language Model Calibration over Short- and Long-form
Responses [14.77013588561901]
We present LitCab, a lightweight calibration mechanism consisting of a single linear layer that takes the input text representation and predicts a bias term.
For evaluation, we construct CaT, a benchmark consisting of eight text generation tasks, covering responses ranging from short phrases to paragraphs.
We test LitCab with Llama2-7B, where it improves calibration across all tasks, reducing the average ECE score by as large as 30%.
arXiv Detail & Related papers (2023-10-30T00:30:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.