Related papers: InformGen: An AI Copilot for Accurate and Compliant Clinical Research Consent Document Generation

InformGen: An AI Copilot for Accurate and Compliant Clinical Research Consent Document Generation

URL: http://arxiv.org/abs/2504.00934v1
Date: Tue, 01 Apr 2025 16:14:48 GMT
Title: InformGen: An AI Copilot for Accurate and Compliant Clinical Research Consent Document Generation
Authors: Zifeng Wang, Junyi Gao, Benjamin Danek, Brandon Theodorou, Ruba Shaik, Shivashankar Thati, Seunghyun Won, Jimeng Sun,
Abstract summary: We present InformGen, an LLM-driven copilot for accurate and compliant informed consent forms (ICFs) drafting.<n> Experimental results demonstrate that InformGen achieves near 100% compliance with 18 core regulatory rules derived from FDA guidelines.<n>When integrated with manual intervention, InformGen attains over 90% factual accuracy, significantly surpassing the vanilla GPT-4o model's 57%-82%.
Score: 22.52678425661723
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Leveraging large language models (LLMs) to generate high-stakes documents, such as informed consent forms (ICFs), remains a significant challenge due to the extreme need for regulatory compliance and factual accuracy. Here, we present InformGen, an LLM-driven copilot for accurate and compliant ICF drafting by optimized knowledge document parsing and content generation, with humans in the loop. We further construct a benchmark dataset comprising protocols and ICFs from 900 clinical trials. Experimental results demonstrate that InformGen achieves near 100% compliance with 18 core regulatory rules derived from FDA guidelines, outperforming a vanilla GPT-4o model by up to 30%. Additionally, a user study with five annotators shows that InformGen, when integrated with manual intervention, attains over 90% factual accuracy, significantly surpassing the vanilla GPT-4o model's 57%-82%. Crucially, InformGen ensures traceability by providing inline citations to source protocols, enabling easy verification and maintaining the highest standards of factual integrity.

Related papers

Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code [0.0]
Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities. This work explores the potential of Small Language Models (SLMs) as a viable alternative for accurate, on-premise vulnerability detection.
arXiv Detail & Related papers (2025-04-23T10:05:27Z)
Uncertainty-aware Long-tailed Weights Model the Utility of Pseudo-labels for Semi-supervised Learning [50.868594148443215]
We propose an Uncertainty-aware Ensemble Structure (UES) to assess the utility of pseudo-labels for unlabeled samples. UES is lightweight and architecture-agnostic, easily extending to various computer vision tasks, including classification and regression.
arXiv Detail & Related papers (2025-03-13T02:21:04Z)
InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference [6.17096244556794]
Inferring Gene Regulatory Networks (GRNs) from gene expression data is crucial for understanding biological processes. We introduce InfoSEM, an unsupervised generative model that leverages textual gene embeddings as informative priors. We propose a biologically motivated benchmarking framework that better reflects real-world applications such as biomarker discovery.
arXiv Detail & Related papers (2025-03-06T14:32:00Z)
Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models [3.0569643495382173]
The Provider Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. validation included Pearson correlation for substantive validity, factor analysis and Cronbach's alpha for structural validity. The PDSQI-9 demonstrated strong internal consistency and high inter-rater reliability.
arXiv Detail & Related papers (2025-01-15T17:47:57Z)
Zero-Shot ATC Coding with Large Language Models for Clinical Assessments [40.72273945475456]
Manual assignment of Anatomical Therapeutic Chemical codes to prescription records is a significant bottleneck.<n>We develop a practical approach using locally deployable large language models (LLMs)<n>We evaluate our approach using GPT-4o as an accuracy ceiling and focus development on open-source Llama models suitable for privacy-sensitive deployment.
arXiv Detail & Related papers (2024-12-10T18:43:02Z)
Automated Proof Generation for Rust Code via Self-Evolution [69.25795662658356]
We introduce SAFE, a novel framework that overcomes the lack of human-written proof to enable automated proof generation of Rust code. We demonstrate superior efficiency and precision compared to GPT-4o. This advancement leads to a significant improvement in performance, achieving a 70.50% accuracy rate in a benchmark crafted by human experts.
arXiv Detail & Related papers (2024-10-21T08:15:45Z)
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs. Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z)
Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language Models [2.186740861187042]
Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets.<n>This paper investigates the potential of large language models (LLMs) to improve adherence to metadata standards.<n>We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository.
arXiv Detail & Related papers (2024-04-08T22:29:53Z)
Benchmarking and Improving Generator-Validator Consistency of Language Models [82.73914625520686]
inconsistency between generating and validating an answer is prevalent in language models (LMs) Even GPT-4, a state-of-the-art LM, is GV-consistent only 76% of the time. We find that this approach improves GV-consistency of Alpaca-30B from 60% to 93%.
arXiv Detail & Related papers (2023-10-03T07:23:22Z)
Fact-Checking Generative AI: Ontology-Driven Biological Graphs for Disease-Gene Link Verification [45.65374554914359]
We aim to achieve fact-checking of the knowledge embedded in biological graphs that were contrived from ChatGPT contents. We adopted a biological networks approach that enables the systematic interrogation of ChatGPT's linked entities. This study demonstrated high accuracy of aggregate disease-gene links relationships found in ChatGPT-generated texts.
arXiv Detail & Related papers (2023-08-07T22:13:30Z)
Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information. We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols. We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z)
UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model. UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data. We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD) UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.