Related papers: Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4

Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4

URL: http://arxiv.org/abs/2404.00484v1
Date: Sat, 30 Mar 2024 22:27:21 GMT
Title: Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4
Authors: Aryo Pradipta Gema, Giwon Hong, Pasquale Minervini, Luke Daines, Beatrice Alex,
Abstract summary: We evaluate various Large Language Models (LLMs) with multiple strategies, including Chain-of-Thought, In-Context Learning, and Efficient Fine-Tuning (PEFT) We found that the two PEFT adapters improves the F1 score (+0.0346) and consistency (+0.152) of the LLMs. Averaging the three metrics, GPT-4 ranks joint-first in the competition with 0.8328.
Score: 10.01547158445743
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The NLI4CT task assesses Natural Language Inference systems in predicting whether hypotheses entail or contradict evidence from Clinical Trial Reports. In this study, we evaluate various Large Language Models (LLMs) with multiple strategies, including Chain-of-Thought, In-Context Learning, and Parameter-Efficient Fine-Tuning (PEFT). We propose a PEFT method to improve the consistency of LLMs by merging adapters that were fine-tuned separately using triplet and language modelling objectives. We found that merging the two PEFT adapters improves the F1 score (+0.0346) and consistency (+0.152) of the LLMs. However, our novel methods did not produce more accurate results than GPT-4 in terms of faithfulness and consistency. Averaging the three metrics, GPT-4 ranks joint-first in the competition with 0.8328. Finally, our contamination analysis with GPT-4 indicates that there was no test data leakage.

Related papers

Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z)
MedSlice: Fine-Tuned Large Language Models for Secure Clinical Note Sectioning [2.4060718165478376]
Fine-tuned open-source LLMs can surpass proprietary models in clinical note sectioning. This study focuses on three sections: History of Present Illness, Interval History, and Assessment and Plan.
arXiv Detail & Related papers (2025-01-23T21:32:09Z)
LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z)
CACER: Clinical Concept Annotations for Cancer Events and Relations [22.866006682711284]
We present Clinical Concept s for Cancer Events and Relations (CACER), a novel corpus with fine-grained annotations for over 48,000 medical problems and drug events. We develop and evaluate transformer-based information extraction models using fine-tuning and in-context learning.
arXiv Detail & Related papers (2024-09-05T20:42:35Z)
Relation Extraction Using Large Language Models: A Case Study on Acupuncture Point Locations [12.632106431145047]
Generative Pre-trained Transformers (GPT) present a significant opportunity for extracting relations related to acupoint locations. This study compares the performance of GPT with traditional deep learning models (LSTM) and Bidirectional Representations from Transformers for Biomedical Text Mining (BioBERT) Fine-tuned GPT-3.5 consistently outperformed other models in F1 scores across all relation types.
arXiv Detail & Related papers (2024-04-08T11:33:00Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
Benchmarking and Analyzing In-context Learning, Fine-tuning and Supervised Learning for Biomedical Knowledge Curation: a focused study on chemical entities of biological interest [2.8216292452982668]
This study compares and analyzes three NLP paradigms for curation: in-context learning (ICL), fine-tuning (FT), and supervised learning (ChML) For ICL, three prompting strategies were employed with GPT-4, GPT-3.5, BioGPT. For ML, six embedding models were utilized for training Random Forest and Long-Short Term Memory models.
arXiv Detail & Related papers (2023-12-20T12:46:44Z)
Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model [0.0]
Large language models (LLMs) have made significant advancements in natural language processing (NLP) Training LLMs on focused corpora poses computational challenges. An alternative approach is to use a retrieval-augmentation (RetA) method tested in a specific domain. OpenAI's GPT-3, GPT-4, Bing's Prometheus, and a custom RetA model were compared using 19 questions on diffuse large B-cell lymphoma (DLBCL) disease.
arXiv Detail & Related papers (2023-05-26T17:33:05Z)
Improving Large Language Models for Clinical Named Entity Recognition via Prompt Engineering [20.534197056683695]
This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks. We developed a task-specific prompt framework that includes baseline prompts, annotation guideline-based prompts, error analysis-based instructions, and annotated samples. We assessed each prompt's effectiveness and compared the models to BioClinicalBERT.
arXiv Detail & Related papers (2023-03-29T02:46:18Z)
How Does In-Context Learning Help Prompt Tuning? [55.78535874154915]
Fine-tuning large language models is becoming ever more impractical due to their rapidly-growing scale. This motivates the use of parameter-efficient adaptation methods such as prompt tuning (PT), which adds a small number of tunable embeddings to an otherwise frozen model. Recently, Singhal et al. (2022) propose instruction prompt tuning'' (IPT), which combines PT with ICL by concatenating a natural language demonstration with learned prompt embeddings.
arXiv Detail & Related papers (2023-02-22T17:45:12Z)
Evaluating Psychological Safety of Large Language Models [72.88260608425949]
We designed unbiased prompts to evaluate the psychological safety of large language models (LLMs) We tested five different LLMs by using two personality tests: Short Dark Triad (SD-3) and Big Five Inventory (BFI) Despite being instruction fine-tuned with safety metrics to reduce toxicity, InstructGPT, GPT-3.5, and GPT-4 still showed dark personality patterns. Fine-tuning Llama-2-chat-7B with responses from BFI using direct preference optimization could effectively reduce the psychological toxicity of the model.
arXiv Detail & Related papers (2022-12-20T18:45:07Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.