BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs
- URL: http://arxiv.org/abs/2510.25087v1
- Date: Wed, 29 Oct 2025 01:51:00 GMT
- Title: BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs
- Authors: Nourah M Salem, Elizabeth White, Michael Bada, Lawrence Hunter,
- Abstract summary: We present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in biomedical texts.<n>We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods.<n> Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting.
- Score: 2.770730728142587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs' performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.
Related papers
- Applications of Small Language Models in Medical Imaging Classification with a Focus on Prompt Strategies [9.1953139634128]
This study investigates the performance of small language models (SLMs) in a medical imaging classification task.<n>Using the NIH Chest X-ray dataset, we evaluate multiple SLMs on the task of classifying chest X-ray positions.<n>Our results show that certain SLMs achieve competitive accuracy with well-crafted prompts.
arXiv Detail & Related papers (2025-08-18T21:48:45Z) - Specialised or Generic? Tokenization Choices for Radiology Language Models [2.081299660192454]
vocabulary used by language models (LM) plays a key role in text generation quality.<n>General, medical, and domain-specific tokenizers on the task of radiology report summarisation across three imaging modalities.<n>Our findings demonstrate that medical and domain-specific vocabularies outperformed widely used natural language alternatives when models are trained from scratch.
arXiv Detail & Related papers (2025-08-13T17:13:56Z) - Leveraging Large Language Models for Rare Disease Named Entity Recognition [7.7603117695645265]
Named Entity Recognition (NER) in the rare disease domain poses unique challenges due to limited labeled data, semantic ambiguity between entity types, and long-tail distributions.<n>In this study, we evaluate the capabilities of GPT-4o for rare disease NER under low-resource settings.
arXiv Detail & Related papers (2025-08-12T20:16:31Z) - Disambiguation in Conversational Question Answering in the Era of LLMs and Agents: A Survey [54.90240495777929]
Ambiguity remains a fundamental challenge in Natural Language Processing (NLP)<n>With the advent of Large Language Models (LLMs), addressing ambiguity has become even more critical due to their expanded capabilities and applications.<n>This paper explores the definition, forms, and implications of ambiguity for language driven systems.
arXiv Detail & Related papers (2025-05-18T20:53:41Z) - Advancing AI Research Assistants with Expert-Involved Learning [84.30323604785646]
Large language models (LLMs) and large multimodal models (LMMs) promise to accelerate biomedical discovery, yet their reliability remains unclear.<n>We introduce ARIEL (AI Research Assistant for Expert-in-the-Loop Learning), an open-source evaluation and optimization framework.<n>We find that state-of-the-art models generate fluent but incomplete summaries, whereas LMMs struggle with detailed visual reasoning.
arXiv Detail & Related papers (2025-05-03T14:21:48Z) - Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding [92.32881381717594]
We introduce ALternate Contrastive Decoding (ALCD) to solve hallucination issues in medical information extraction tasks.
ALCD demonstrates significant improvements in resolving hallucination issues compared to conventional decoding methods.
arXiv Detail & Related papers (2024-10-21T07:19:19Z) - Zero-shot Causal Graph Extrapolation from Text via LLMs [50.596179963913045]
We evaluate the ability of large language models (LLMs) to infer causal relations from natural language.
LLMs show competitive performance in a benchmark of pairwise relations without needing (explicit) training samples.
We extend our approach to extrapolating causal graphs through iterated pairwise queries.
arXiv Detail & Related papers (2023-12-22T13:14:38Z) - Inspire the Large Language Model by External Knowledge on BioMedical
Named Entity Recognition [3.427366431933441]
Large language models (LLMs) have demonstrated dominating performance in many NLP tasks, especially on generative tasks.
We leverage the LLM to solve the Biomedical NER task into entity span extraction and entity type determination.
Experimental results show a significant improvement in our two-step BioNER approach compared to previous few-shot LLM baseline.
arXiv Detail & Related papers (2023-09-21T17:39:53Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Detecting Idiomatic Multiword Expressions in Clinical Terminology using
Definition-Based Representation Learning [12.30055843580139]
We develop an effective tool for scoring the idiomaticity of biomedical MWEs based on the degree of similarity between the semantic representations of those MWEs and a weighted average of the representation of their constituents.
Our results show that the BioLORD model has a strong ability to identify idiomatic MWEs, not replicated in other models.
arXiv Detail & Related papers (2023-05-11T13:42:58Z) - Benchmarking large language models for biomedical natural language processing applications and recommendations [22.668383945059762]
Large Language Models (LLMs) have shown promise in general domains.<n>We compare their zero-shot, few-shot, and fine-tuning performance with traditional fine-tuning of BERT or BART models.<n>We find issues like missing information and hallucinations in LLM outputs.
arXiv Detail & Related papers (2023-05-10T13:40:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.