An Auditable Pipeline for Fuzzy Full-Text Screening in Systematic Reviews: Integrating Contrastive Semantic Highlighting and LLM Judgment
- URL: http://arxiv.org/abs/2508.15822v1
- Date: Sun, 17 Aug 2025 17:41:50 GMT
- Title: An Auditable Pipeline for Fuzzy Full-Text Screening in Systematic Reviews: Integrating Contrastive Semantic Highlighting and LLM Judgment
- Authors: Pouria Mortezaagha, Arya Rahgozar,
- Abstract summary: Full-text screening is the major bottleneck of systematic reviews.<n>We present a scalable, auditable pipeline that reframes inclusion/exclusion as a fuzzy decision problem.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Full-text screening is the major bottleneck of systematic reviews (SRs), as decisive evidence is dispersed across long, heterogeneous documents and rarely admits static, binary rules. We present a scalable, auditable pipeline that reframes inclusion/exclusion as a fuzzy decision problem and benchmark it against statistical and crisp baselines in the context of the Population Health Modelling Consensus Reporting Network for noncommunicable diseases (POPCORN). Articles are parsed into overlapping chunks and embedded with a domain-adapted model; for each criterion (Population, Intervention, Outcome, Study Approach), we compute contrastive similarity (inclusion-exclusion cosine) and a vagueness margin, which a Mamdani fuzzy controller maps into graded inclusion degrees with dynamic thresholds in a multi-label setting. A large language model (LLM) judge adjudicates highlighted spans with tertiary labels, confidence scores, and criterion-referenced rationales; when evidence is insufficient, fuzzy membership is attenuated rather than excluded. In a pilot on an all-positive gold set (16 full texts; 3,208 chunks), the fuzzy system achieved recall of 81.3% (Population), 87.5% (Intervention), 87.5% (Outcome), and 75.0% (Study Approach), surpassing statistical (56.3-75.0%) and crisp baselines (43.8-81.3%). Strict "all-criteria" inclusion was reached for 50.0% of articles, compared to 25.0% and 12.5% under the baselines. Cross-model agreement on justifications was 98.3%, human-machine agreement 96.1%, and a pilot review showed 91% inter-rater agreement (kappa = 0.82), with screening time reduced from about 20 minutes to under 1 minute per article at significantly lower cost. These results show that fuzzy logic with contrastive highlighting and LLM adjudication yields high recall, stable rationale, and end-to-end traceability.
Related papers
- CORE: Comprehensive Ontological Relation Evaluation for Large Language Models [0.9668495520241466]
Large Language Models (LLMs) perform well on many reasoning benchmarks, yet existing evaluations rarely assess their ability to distinguish between meaningful semantic relations and genuine unrelatedness.<n>We introduce CORE (Comprehensive Ontological Relation Evaluation), a dataset of 225K multiple-choice questions spanning 74 disciplines.<n>A human baseline from 1,000+ participants achieves 92.6% accuracy (95.1% on unrelated pairs)
arXiv Detail & Related papers (2026-02-06T07:16:33Z) - ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning in Large Reasoning Models [67.15960154375131]
Large reasoning models (LRMs) extend large language models with explicit multi-step reasoning traces.<n>This capability introduces a new class of prompt-induced inference-time denial-of-service (PI-DoS) attacks that exploit the high computational cost of reasoning.<n>We present ReasoningBomb, a reinforcement-learning-based PI-DoS framework that is guided by a constant-time surrogate reward.
arXiv Detail & Related papers (2026-01-29T18:53:01Z) - EdgeJury: Cross-Reviewed Small-Model Ensembles for Truthful Question Answering on Serverless Edge Inference [0.0]
We present EdgeJury, a lightweight ensemble framework that improves truthfulness and robustness.<n>On TruthfulQA (MC1), EdgeJury achieves 76.2% accuracy.<n>On a 200-question adversarial EdgeCases set, EdgeJury yields +48.2% relative gains.
arXiv Detail & Related papers (2025-12-29T14:48:40Z) - NewsScope: Schema-Grounded Cross-Domain News Claim Extraction with Open Models [0.15039745292757667]
NewsScope is a cross-domain dataset, benchmark, and fine-tuned model for schema-grounded news claim extraction.<n>The dataset contains 455 articles across politics, health, science/environment, and business.<n>LLaMA 3.1 8B was finetuned using LoRA on 315 training examples and evaluated on held-out in-domain (80 articles) and out-of-source (60 articles) test sets.
arXiv Detail & Related papers (2025-12-26T19:17:21Z) - AI Transparency Atlas: Framework, Scoring, and Real-Time Model Card Evaluation Pipeline [2.1787849426740364]
We analyzed documentation from five frontier models (Gemini 3, Grok 4.1, Llama 4, GPT-5, and Claude 4.5) and 100 Hugging Face model cards.<n>We developed a weighted transparency framework with 8 sections and 23 subsections that prioritizes safety-critical disclosures.
arXiv Detail & Related papers (2025-12-13T19:48:44Z) - CLUE: Non-parametric Verification from Experience via Hidden-State Clustering [64.50919789875233]
We show that correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations.<n>ClUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates.
arXiv Detail & Related papers (2025-10-02T02:14:33Z) - TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z) - GANDiff FR: Hybrid GAN Diffusion Synthesis for Causal Bias Attribution in Face Recognition [0.0]
We introduce GANDiff FR, the first synthetic framework that precisely controls demographic and environmental factors to measure, explain, and reduce bias with reproducible rigor.<n>We synthesize 10,000 demographically balanced faces across five cohorts validated for realism via automated detection and human review.<n> Benchmarking ArcFace, CosFace, and AdaFace under matched operating points shows AdaFace reduces inter-group TPR disparity by 60%.<n>Despite around 20% computational overhead relative to pure GANs, GANDiff FR yields three times more attribute-conditioned variants.
arXiv Detail & Related papers (2025-08-15T09:05:57Z) - GenFair: Systematic Test Generation for Fairness Fault Detection in Large Language Models [0.12891210250935142]
Large Language Models (LLMs) are increasingly deployed in critical domains, yet they often exhibit biases inherited from training data, leading to fairness concerns.<n>This work focuses on the problem of effectively detecting fairness violations, especially intersectional biases that are often missed by existing template-based and grammar-based testing methods.<n>We propose GenFair, a metamorphic fairness testing framework that generates source test cases using equivalence partitioning, mutation operators, and boundary value analysis.
arXiv Detail & Related papers (2025-06-03T16:00:30Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model [1.7064514726335305]
We analyzed 9,683 Hebrew radiology reports from Crohn's disease patients.<n>We incorporated uncertainty-aware prompt ensembles and an agent-based decision model.
arXiv Detail & Related papers (2025-02-02T16:57:03Z) - Streamlining Systematic Reviews: A Novel Application of Large Language Models [1.921297555859566]
Systematic reviews (SRs) are essential for evidence-based guidelines but are often limited by the time-consuming nature of literature screening.<n>We propose and evaluate an in-house system based on Large Language Models (LLMs) for automating both title/abstract and full-text screening.
arXiv Detail & Related papers (2024-12-14T17:08:34Z) - Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability [0.0]
Large Language Models (LLMs) have shown significant advances in text generation but often lack the reliability needed for autonomous deployment.
We introduce a novel framework that repurposes ensemble methods for content validation through model consensus.
In tests across 78 complex cases requiring factual accuracy and causal consistency, our framework improved precision from 73.1% to 93.9%.
arXiv Detail & Related papers (2024-11-10T17:32:16Z) - Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios [49.53589774730807]
Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding.<n>We reveal a response uncertainty phenomenon: twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue.
arXiv Detail & Related papers (2024-11-05T01:11:28Z) - CausalDiff: Causality-Inspired Disentanglement via Diffusion Model for Adversarial Defense [61.78357530675446]
Humans are difficult to be cheated by subtle manipulations, since we make judgments only based on essential factors.<n>Inspired by this observation, we attempt to model label generation with essential label-causative factors and incorporate label-non-causative factors to assist data generation.<n>For an adversarial example, we aim to discriminate perturbations as non-causative factors and make predictions only based on the label-causative factors.
arXiv Detail & Related papers (2024-10-30T15:06:44Z) - LLMs Can Patch Up Missing Relevance Judgments in Evaluation [56.51461892988846]
We use large language models (LLMs) to automatically label unjudged documents.
We simulate scenarios with varying degrees of holes by randomly dropping relevant documents from the relevance judgment in TREC DL tracks.
Our method achieves a Kendall tau correlation of 0.87 and 0.92 on an average for Vicuna-7B and GPT-3.5 Turbo respectively.
arXiv Detail & Related papers (2024-05-08T00:32:19Z) - Patch-Level Contrasting without Patch Correspondence for Accurate and
Dense Contrastive Representation Learning [79.43940012723539]
ADCLR is a self-supervised learning framework for learning accurate and dense vision representation.
Our approach achieves new state-of-the-art performance for contrastive methods.
arXiv Detail & Related papers (2023-06-23T07:38:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.