Related papers: A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models

A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models

URL: http://arxiv.org/abs/2504.04083v2
Date: Thu, 18 Sep 2025 23:51:48 GMT
Title: A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models
Authors: Aviv Brokman, Xuguang Ai, Yuhang Jiang, Shashank Gupta, Ramakanth Kavuluru,
Abstract summary: We study patterns in the performance of OpenAI LLMs across a diverse sampling of biomedical relation extraction tasks.<n>Our work is the first to study and compare the performance of the GPT-4, o1 and GPT-OSS for the end-to-end zero-shot biomedical RE task.
Score: 8.888403936226185
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Objective: Zero-shot methodology promises to cut down on costs of dataset annotation and domain expertise needed to make use of NLP. Generative large language models trained to align with human goals have achieved high zero-shot performance across a wide variety of tasks. As of yet, it is unclear how well these models perform on biomedical relation extraction (RE). To address this knowledge gap, we explore patterns in the performance of OpenAI LLMs across a diverse sampling of RE tasks. Methods: We use OpenAI GPT-4-turbo and OpenAI's reasoning models o1 and GPT-OSS to conduct end-to-end RE experiments on seven datasets. We use the JSON generation capabilities of GPT models to generate structured output in two ways: (1) by defining an explicit schema describing the structure of relations, and (2) using a setting that infers the structure from the prompt language. Results: Our work is the first to study and compare the performance of the GPT-4, o1 and GPT-OSS for the end-to-end zero-shot biomedical RE task across a broad array of datasets. We found the zero-shot performances to be proximal to that of fine-tuned methods. The limitations of this approach are that it performs poorly on instances containing many relations and errs on the boundaries of textual mentions. Conclusion: LLMs exhibit promising zero-shot capabilities in complex biomedical RE tasks, offering competitive performance with reduced dataset curation costs and NLP modeling needs but with increased perpetual compute costs. Addressing the limitations we identify could further boost reliability. The code, data, and prompts for all our experiments are publicly available for additional benchmarking by the community: https://github.com/bionlproc/ZeroShotRE

Related papers

Zero-Shot Document-Level Biomedical Relation Extraction via Scenario-based Prompt Design in Two-Stage with LLM [7.808231572590279]
We propose a novel approach to achieve the same results from unannotated full documents using general large language models (LLMs) with lower hardware and labor costs.<n>Our approach combines two major stages: named entity recognition (NER) and relation extraction (RE)<n>To enhance the effectiveness of prompt, we propose a five-part template structure and a scenario-based prompt design principles.
arXiv Detail & Related papers (2025-05-02T07:33:20Z)
Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.<n>They generate only a limited range of perturbations for a single Information Extraction (IE) task.<n>Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.<n>We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
DataSciBench: An LLM Agent Benchmark for Data Science [33.3811507234528]
DataSciBench is a benchmark for evaluating Large Language Model (LLM) capabilities in data science.<n>We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics.<n>We propose an innovative Task - Function - Code framework to assess each code execution outcome.
arXiv Detail & Related papers (2025-02-19T17:31:51Z)
Improving Equity in Health Modeling with GPT4-Turbo Generated Synthetic Data: A Comparative Study [1.85743121594882]
Demographic groups are often represented at different rates in medical datasets.<n>These differences can create bias in machine learning algorithms, with higher levels of performance for better-represented groups.<n>One promising solution is to generate synthetic data to mitigate potential adverse effects of non-representative data sets.
arXiv Detail & Related papers (2024-12-20T20:49:17Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources [13.750202656564907]
Adverse event (AE) extraction is crucial for monitoring and analyzing the safety profiles of immunizations. This study aims to evaluate the effectiveness of large language models (LLMs) and traditional deep learning models in AE extraction.
arXiv Detail & Related papers (2024-06-26T03:56:21Z)
ExaRanker-Open: Synthetic Explanation for IR using Open-Source LLMs [60.81649785463651]
We introduce ExaRanker-Open, where we adapt and explore the use of open-source language models to generate explanations. Our findings reveal that incorporating explanations consistently enhances neural rankers, with benefits escalating as the LLM size increases.
arXiv Detail & Related papers (2024-02-09T11:23:14Z)
Relation Extraction in underexplored biomedical domains: A diversity-optimised sampling and synthetic data generation approach [0.0]
sparsity of labelled data is an obstacle to the development of Relation Extraction models. We create the first curated evaluation dataset and extracted literature items from the LOTUS database to build training sets. We evaluate the performance of standard fine-tuning as a generative task and few-shot learning with open Large Language Models.
arXiv Detail & Related papers (2023-11-10T19:36:00Z)
BioREx: Improving Biomedical Relation Extraction by Leveraging Heterogeneous Datasets [7.7587371896752595]
Biomedical relation extraction (RE) is a central task in biomedical natural language processing (NLP) research. We present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset.
arXiv Detail & Related papers (2023-06-19T22:48:18Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM. For learning methods, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z)
An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians. Recent studies have achieved promising results in automatic impression generation using large-scale medical text data. These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z)
Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining. We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data. Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z)
ZeroGen: Efficient Zero-shot Learning via Dataset Generation [28.454620513642034]
We study a flexible and efficient zero-short learning method, ZeroGen. Given a zero-shot task, we first generate a dataset from scratch using PLMs in an unsupervised manner. Experiments and analysis on different NLP tasks, namely, text classification, question answering, and natural language inference, show the effectiveness of ZeroGen.
arXiv Detail & Related papers (2022-02-16T08:18:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.