Vocabulary Transfer for Medical Texts
- URL: http://arxiv.org/abs/2208.02554v1
- Date: Thu, 4 Aug 2022 09:53:22 GMT
- Title: Vocabulary Transfer for Medical Texts
- Authors: Vladislav D. Mosin, Ivan P. Yamshchikov
- Abstract summary: vocabulary transfer is a subtask in which language models fine-tune with the corpus-specific tokenization instead of the default one.
We demonstrate that vocabulary transfer is especially beneficial for medical text processing.
- Score: 7.195824023358536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vocabulary transfer is a transfer learning subtask in which language models
fine-tune with the corpus-specific tokenization instead of the default one,
which is being used during pretraining. This usually improves the resulting
performance of the model, and in the paper, we demonstrate that vocabulary
transfer is especially beneficial for medical text processing. Using three
different medical natural language processing datasets, we show vocabulary
transfer to provide up to ten extra percentage points for the downstream
classifier accuracy.
Related papers
- Dementia Insights: A Context-Based MultiModal Approach [0.3749861135832073]
Early detection is crucial for timely interventions that may slow disease progression.
Large pre-trained models (LPMs) for text and audio have shown promise in identifying cognitive impairments.
This study proposes a context-based multimodal method, integrating both text and audio data using the best-performing LPMs.
arXiv Detail & Related papers (2025-03-03T06:46:26Z) - Latent Paraphrasing: Perturbation on Layers Improves Knowledge Injection in Language Models [54.385486006684495]
LaPael is a latent-level paraphrasing method that applies input-dependent noise to early Large Language Models layers.
Our experiments on question-answering benchmarks demonstrate that LaPael improves knowledge injection over standard fine-tuning and existing noise-based approaches.
arXiv Detail & Related papers (2024-11-01T15:47:05Z) - Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z) - An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Enhancing Medical Specialty Assignment to Patients using NLP Techniques [0.0]
We propose an alternative approach that achieves superior performance while being computationally efficient.
Specifically, we utilize keywords to train a deep learning architecture that outperforms a language model pretrained on a large corpus of text.
Our results demonstrate that utilizing keywords for text classification significantly improves classification performance.
arXiv Detail & Related papers (2023-12-09T14:13:45Z) - Towards preserving word order importance through Forced Invalidation [80.33036864442182]
We show that pre-trained language models are insensitive to word order.
We propose Forced Invalidation to help preserve the importance of word order.
Our experiments demonstrate that Forced Invalidation significantly improves the sensitivity of the models to word order.
arXiv Detail & Related papers (2023-04-11T13:42:10Z) - Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining.
We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data.
Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z) - Leveraging knowledge graphs to update scientific word embeddings using
latent semantic imputation [0.0]
We show how glslsi can impute embeddings for domain-specific words from up-to-date knowledge graphs.
We show that LSI can produce reliable embedding vectors for rare and OOV terms in the biomedical domain.
arXiv Detail & Related papers (2022-10-27T12:15:26Z) - Fine-Tuning Large Neural Language Models for Biomedical Natural Language
Processing [55.52858954615655]
We conduct a systematic study on fine-tuning stability in biomedical NLP.
We show that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains.
We show that these techniques can substantially improve fine-tuning performance for lowresource biomedical NLP applications.
arXiv Detail & Related papers (2021-12-15T04:20:35Z) - AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain [17.115865763783336]
We propose to consider the vocabulary as an optimizable parameter, allowing us to update the vocabulary by expanding it with domain-specific vocabulary.
We preserve the embeddings of the added words from overfitting to downstream data by utilizing knowledge learned from a pretrained language model with a regularization term.
arXiv Detail & Related papers (2021-10-26T06:26:01Z) - Recognising Biomedical Names: Challenges and Solutions [9.51284672475743]
We propose a transition-based NER model which can recognise discontinuous mentions.
We also develop a cost-effective approach that nominates the suitable pre-training data.
Our contributions have obvious practical implications, especially when new biomedical applications are needed.
arXiv Detail & Related papers (2021-06-23T08:20:13Z) - Integration of Domain Knowledge using Medical Knowledge Graph Deep
Learning for Cancer Phenotyping [6.077023952306772]
We propose a method to integrate external knowledge from medical terminology into the context captured by word embeddings.
We evaluate the proposed approach using a Multitask Convolutional Neural Network (MT-CNN) to extract six cancer characteristics from a dataset of 900K cancer pathology reports.
arXiv Detail & Related papers (2021-01-05T03:59:43Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.