Related papers: Model Tuning or Prompt Tuning? A Study of Large Language Models for Clinical Concept and Relation Extraction

Model Tuning or Prompt Tuning? A Study of Large Language Models for Clinical Concept and Relation Extraction

URL: http://arxiv.org/abs/2310.06239v1
Date: Tue, 10 Oct 2023 01:27:08 GMT
Title: Model Tuning or Prompt Tuning? A Study of Large Language Models for Clinical Concept and Relation Extraction
Authors: Cheng Peng, Xi Yang, Kaleb E Smith, Zehao Yu, Aokun Chen, Jiang Bian, Yonghui Wu
Abstract summary: We develop soft prompt-based learning algorithms for large language models (LLMs) We compare 4 training strategies including fine-tuning without prompts, hard-prompt with unfrozen LLMs, soft-prompt with frozen LLMs, and soft-prompt with frozen LLMs. We evaluate the transfer learning ability of the prompt-based learning algorithms in a cross-institution setting.
Score: 26.504643007899592
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Objective To develop soft prompt-based learning algorithms for large language models (LLMs), examine the shape of prompts, prompt-tuning using frozen/unfrozen LLMs, transfer learning, and few-shot learning abilities. Methods We developed a soft prompt-based LLM model and compared 4 training strategies including (1) fine-tuning without prompts; (2) hard-prompt with unfrozen LLMs; (3) soft-prompt with unfrozen LLMs; and (4) soft-prompt with frozen LLMs. We evaluated 7 pretrained LLMs using the 4 training strategies for clinical concept and relation extraction on two benchmark datasets. We evaluated the transfer learning ability of the prompt-based learning algorithms in a cross-institution setting. We also assessed the few-shot learning ability. Results and Conclusion When LLMs are unfrozen, GatorTron-3.9B with soft prompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept extraction, outperforming the traditional fine-tuning and hard prompt-based models by 0.6~3.1% and 1.2~2.9%, respectively; GatorTron-345M with soft prompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end relation extraction, outperforming the other two models by 0.2~2% and 0.6~11.7%, respectively. When LLMs are frozen, small (i.e., 345 million parameters) LLMs have a big gap to be competitive with unfrozen models; scaling LLMs up to billions of parameters makes frozen LLMs competitive with unfrozen LLMs. For cross-institute evaluation, soft prompting with a frozen GatorTron-8.9B model achieved the best performance. This study demonstrates that (1) machines can learn soft prompts better than humans, (2) frozen LLMs have better few-shot learning ability and transfer learning ability to facilitate muti-institution applications, and (3) frozen LLMs require large models.

Related papers

An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z)
Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring [18.33969226071914]
We compare nine leading large language models (LLMs) across text assessment and generation tasks related to automated essay scoring. Our findings reveal that for few-shot learning-based assessment of human generated essays, open LLMs such as Llama 3 and Qwen2.5 perform comparably to GPT-4 in terms of predictive performance. For generative tasks, we find that essays generated by top open LLMs are comparable to closed LLMs in terms of their semantic composition/embeddings and ML assessed scores.
arXiv Detail & Related papers (2025-03-14T19:34:40Z)
LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs. We introduce LLM2, a novel framework that combines an LLM with a process-based verifier. LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z)
Prompting Large Language Models for Clinical Temporal Relation Extraction [5.403858596195122]
This study utilizes four large language models (LLMs) for clinical temporal relation extraction (CTRE) We developed full (FFT) and parameter-efficient (PEFT) fine-tuning strategies and evaluated these strategies on the 2012 i2b2 CTRE task.
arXiv Detail & Related papers (2024-12-04T18:35:28Z)
SEUF: Is Unlearning One Expert Enough for Mixture-of-Experts LLMs? [35.237427998489785]
We propose a novel Selected-Expert Unlearning Framework (SEUF) for Mixture-of-Experts (MoE) LLMs.<n>Through expert attribution, unlearning is concentrated on the most actively engaged experts for the specified knowledge.<n>SEUF is compatible with various standard unlearning algorithms.
arXiv Detail & Related papers (2024-11-27T22:46:08Z)
LLM-Neo: Parameter Efficient Knowledge Distillation for Large Language Models [45.99790250483618]
We propose a novel framework that efficiently transfers knowledge from a large language model to a compact student. Inspired by this observation, we explore the strategy that combines LoRA and KD to enhance the efficiency of knowledge transfer.
arXiv Detail & Related papers (2024-11-11T10:07:51Z)
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z)
MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts [29.593170782882563]
Large Language Models (LLMs) can memorize sensitive information, raising concerns about potential misuse. Previous practices face three key challenges: Utility, efficiency, and robustness. We propose MEOW, a gradient descent-based unlearning method.
arXiv Detail & Related papers (2024-09-18T09:55:48Z)
Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs [30.333277284839053]
Large language models (LLMs) have shown success in generating high-quality responses. Existing methods to enhance response quality often involve a prompt refinement model. We introduce a self-instructed in-context learning framework that empowers LLMs to deliver more effective responses.
arXiv Detail & Related papers (2024-09-03T02:42:39Z)
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation [41.05687297326706]
LLaVA-MoD is a framework designed to enable the efficient training of small-scale Multimodal Language Models. We optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts architecture into the language model. We also propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration.
arXiv Detail & Related papers (2024-08-28T15:52:23Z)
Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement [72.97553348776425]
We make a pioneering effort to broaden the applicability of merging techniques from FT to PT LLMs. We introduce an approach based on WeIght DisENtanglement (WIDEN) to effectively extend the merging scope. We merge Qwen1.5-Chat (an FT LLM with instruction-following skills) with Sailor (a PT LLM with multilingual abilities) across 7B and 14B model scales.
arXiv Detail & Related papers (2024-08-06T10:46:46Z)
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated [93.45300714803429]
We introduce Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs) Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. We also introduce Block Q-Sparse for batch training and inference.
arXiv Detail & Related papers (2024-07-15T17:59:29Z)
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models. We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.