Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings
- URL: http://arxiv.org/abs/2505.21242v1
- Date: Tue, 27 May 2025 14:23:03 GMT
- Title: Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings
- Authors: Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly,
- Abstract summary: Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning.<n>We show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary words or with high novelty.<n> Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue.
- Score: 26.442558912559658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning. However, these recent efforts do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words or subwords. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces over-fragmentation issue with medical words. To that end, we show vocabulary adaptation helps improve the LLM summarization performance even in difficult settings. Through extensive experimentation of multiple vocabulary adaptation strategies, two continual pretraining strategies, and three benchmark medical summarization datasets, we gain valuable insights into the role of vocabulary adaptation strategies for customizing LLMs to the medical domain. We also performed a human evaluation study with medical experts where they found that vocabulary adaptation results in more relevant and faithful summaries. Our codebase is made publicly available at https://github.com/gb-kgp/LLM-MedicalSummarization-Benchmark.
Related papers
- Exploiting the English Vocabulary Profile for L2 word-level vocabulary assessment with LLMs [2.201161230389126]
This paper introduces a novel approach to enable fine-grained vocabulary evaluation.<n>The scheme combines large language models (LLMs) with the English Vocabulary Profile (EVP)<n>The EVP is a standard lexical resource that enables in-context vocabulary use to be linked with proficiency level.
arXiv Detail & Related papers (2025-06-03T11:23:57Z) - How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text? [38.1823640848362]
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English.
LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary.
Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue.
arXiv Detail & Related papers (2024-06-17T12:42:34Z) - MEDVOC: Vocabulary Adaptation for Fine-tuning Pre-trained Language Models on Medical Text Summarization [26.442558912559658]
This work presents a dynamic vocabulary adaptation strategy, MEDVOC, for fine-tuning pre-trained language models (PLMs)
In contrast to existing domain adaptation approaches in summarization, MEDVOC treats vocabulary as an optimizable parameter.
Our human evaluation shows MEDVOC generates more faithful medical summaries.
arXiv Detail & Related papers (2024-05-07T10:00:00Z) - PhonologyBench: Evaluating Phonological Skills of Large Language Models [57.80997670335227]
Phonology, the study of speech's structure and pronunciation rules, is a critical yet often overlooked component in Large Language Model (LLM) research.
We present PhonologyBench, a novel benchmark consisting of three diagnostic tasks designed to explicitly test the phonological skills of LLMs.
We observe a significant gap of 17% and 45% on Rhyme Word Generation and Syllable counting, respectively, when compared to humans.
arXiv Detail & Related papers (2024-04-03T04:53:14Z) - FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence [46.71469172542448]
This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts.
It consists of 345 plain language summaries of abstracts generated from three randomized controlled trials (RCTs)
We assess the factuality of critical elements of RCTs in those summaries, as well as the reported findings concerning these.
arXiv Detail & Related papers (2024-02-18T04:45:01Z) - When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models [59.84769254832941]
We propose a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp.
Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment.
Based on FLUB, we investigate the performance of multiple representative and advanced LLMs.
arXiv Detail & Related papers (2024-02-16T22:12:53Z) - Salute the Classic: Revisiting Challenges of Machine Translation in the
Age of Large Language Models [91.6543868677356]
The evolution of Neural Machine Translation has been influenced by six core challenges.
These challenges include domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search.
This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models.
arXiv Detail & Related papers (2024-01-16T13:30:09Z) - Adapting Large Language Models for Document-Level Machine Translation [46.370862171452444]
Large language models (LLMs) have significantly advanced various natural language processing (NLP) tasks.
Recent research indicates that moderately-sized LLMs often outperform larger ones after task-specific fine-tuning.
This study focuses on adapting LLMs for document-level machine translation (DocMT) for specific language pairs.
arXiv Detail & Related papers (2024-01-12T09:29:13Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - CohortGPT: An Enhanced GPT for Participant Recruitment in Clinical Study [17.96401880059829]
Large Language Models (LLMs) such as ChatGPT have achieved tremendous success in various downstream tasks.
We propose to use a knowledge graph as auxiliary information to guide the LLMs in making predictions.
Our few-shot learning method achieves satisfactory performance compared with fine-tuning strategies.
arXiv Detail & Related papers (2023-07-21T04:43:00Z) - Always Keep your Target in Mind: Studying Semantics and Improving
Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z) - An Automated Method to Enrich Consumer Health Vocabularies Using GloVe
Word Embeddings and An Auxiliary Lexical Resource [0.0]
A layman may have difficulty communicating with a professional due to not understanding specialized terms common to the domain.
Several professional vocabularies have been created to map laymen medical terms to professional medical terms and vice versa.
We present an automatic method to enrich laymen's vocabularies that has the benefit of being able to be applied to vocabularies in any domain.
arXiv Detail & Related papers (2021-05-18T20:16:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.