Related papers: Infusing clinical knowledge into tokenisers for language models

Infusing clinical knowledge into tokenisers for language models

URL: http://arxiv.org/abs/2406.14312v1
Date: Thu, 20 Jun 2024 13:43:03 GMT
Title: Infusing clinical knowledge into tokenisers for language models
Authors: Abul Hasan, Jinge Wu, Quang Ngoc Nguyen, Salomé Andres, Imane Guellil, Huayu Zhang, Arlene Casey, Beatrice Alex, Bruce Guthrie, Honghan Wu,
Abstract summary: This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. At initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens.
Score: 1.9921590146992474
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text analytics tasks including clinical concept and relation extraction, automated clinical coding, clinical phenotype identification, and clinical research article classification. Overall, our models demonstrate consistent improvements over their counterparts in all tasks. In particular, substantial improvements are observed in the automated clinical coding task with 13\% increase on Micro $F_1$ score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker converge of language models. Specifically, using K-Tokeniser, the language models would only require 50\% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20\% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable.

Related papers

ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs [1.519321208145928]
ConTextual is a novel framework that integrates a Context-Preserving Token Filtering method with a Domain-Specific Knowledge Graph. By preserving context-specific important tokens and enriching them with structured knowledge, ConTextual improves both linguistic coherence and clinical fidelity.
arXiv Detail & Related papers (2025-04-23T03:42:46Z)
Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark [0.29687381456163997]
Tokenization is a fundamental preprocessing step in NLP, directly impacting large language models' ability to capture syntactic, morphosyntactic, and semantic structures. This paper introduces a novel framework for evaluating tokenization strategies, addressing challenges in morphologically rich and low-resource languages.
arXiv Detail & Related papers (2025-02-10T21:47:49Z)
SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP) SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings. Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z)
Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond [66.07002187192448]
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task. We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol. We find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens.
arXiv Detail & Related papers (2023-10-09T00:20:59Z)
Towards Unifying Anatomy Segmentation: Automated Generation of a Full-body CT Dataset via Knowledge Aggregation and Anatomical Guidelines [113.08940153125616]
We generate a dataset of whole-body CT scans with $142$ voxel-level labels for 533 volumes providing comprehensive anatomical coverage. Our proposed procedure does not rely on manual annotation during the label aggregation stage. We release our trained unified anatomical segmentation model capable of predicting $142$ anatomical structures on CT data.
arXiv Detail & Related papers (2023-07-25T09:48:13Z)
Modelling Temporal Document Sequences for Clinical ICD Coding [9.906895077843663]
We propose a hierarchical transformer architecture that uses text across the entire sequence of clinical notes in each hospital stay for ICD coding. While using all clinical notes increases the quantity of data substantially, superconvergence can be used to reduce training costs. Our model exceeds the prior state-of-the-art when using only discharge summaries as input, and achieves further performance improvements when all clinical notes are used as input.
arXiv Detail & Related papers (2023-02-24T14:41:48Z)
Cross-Lingual Knowledge Transfer for Clinical Phenotyping [55.92262310716537]
We investigate cross-lingual knowledge transfer strategies to execute this task for clinics that do not use the English language. We evaluate these strategies for a Greek and a Spanish clinic leveraging clinical notes from different clinical domains. Our results show that using multilingual data overall improves clinical phenotyping models and can compensate for data sparseness.
arXiv Detail & Related papers (2022-08-03T08:33:21Z)
Classifying Unstructured Clinical Notes via Automatic Weak Supervision [17.45660355026785]
We introduce a general weakly-supervised text classification framework that learns from class-label descriptions only. We leverage the linguistic domain knowledge stored within pre-trained language models and the data programming framework to assign code labels to texts.
arXiv Detail & Related papers (2022-06-24T05:55:49Z)
Large Language Models are Zero-Shot Clinical Information Extractors [15.907327589436965]
We show that large language models, such as GPT-3, perform well at zero-shot information extraction from clinical text. We present examples showing how to use these models as tools for the diverse tasks of (i) concept disambiguation, (ii) evidence extraction, (iii) coreference resolution, and (iv) concept extraction. The key to good performance is the use of simple task-specific programs that map from the language model outputs to the label space of the task.
arXiv Detail & Related papers (2022-05-25T11:49:58Z)
HealthPrompt: A Zero-shot Learning Paradigm for Clinical Natural Language Processing [3.762895631262445]
We developed a novel prompt-based clinical NLP framework called HealthPrompt. We performed an in-depth analysis of HealthPrompt on six different PLMs in a no-data setting. Our experiments prove that prompts effectively capture the context of clinical texts and perform remarkably well without any training data.
arXiv Detail & Related papers (2022-03-09T21:44:28Z)
Detecting of a Patient's Condition From Clinical Narratives Using Natural Language Representation [0.3149883354098941]
This paper proposes a joint clinical natural language representation learning and supervised classification framework. The novel framework jointly discovers distributional syntactic and latent semantic (representation learning) from contextual clinical narrative inputs. The proposed framework yields an overall classification performance with accuracy, recall, and precision of 89 % and 88 %, 89 %, respectively.
arXiv Detail & Related papers (2021-04-08T17:16:04Z)
A Meta-embedding-based Ensemble Approach for ICD Coding Prediction [64.42386426730695]
International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding. These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information. Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles.
arXiv Detail & Related papers (2021-02-26T17:49:58Z)
Benchmarking Automated Clinical Language Simplification: Dataset, Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches. We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.