Related papers: TocBERT: Medical Document Structure Extraction Using Bidirectional Transformers

TocBERT: Medical Document Structure Extraction Using Bidirectional Transformers

URL: http://arxiv.org/abs/2406.19526v1
Date: Thu, 27 Jun 2024 20:56:57 GMT
Title: TocBERT: Medical Document Structure Extraction Using Bidirectional Transformers
Authors: Majd Saleh, Sarra Baghdadi, Stéphane Paquelet,
Abstract summary: TocBERT represents a supervised solution trained on the detection of titles and sub-titles from semantic representations. The solution has been applied on a medical text segmentation use-case where the Bio-ClinicalBERT model is fine-tuned to segment discharge summaries of the MIMIC-III dataset. It achieved an F1-score of 84.6% when evaluated on a linear text segmentation problem and 72.8% on a hierarchical text segmentation problem.
Score: 1.2343981093497332
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text segmentation holds paramount importance in the field of Natural Language Processing (NLP). It plays an important role in several NLP downstream tasks like information retrieval and document summarization. In this work, we propose a new solution, namely TocBERT, for segmenting texts using bidirectional transformers. TocBERT represents a supervised solution trained on the detection of titles and sub-titles from their semantic representations. This task was formulated as a named entity recognition (NER) problem. The solution has been applied on a medical text segmentation use-case where the Bio-ClinicalBERT model is fine-tuned to segment discharge summaries of the MIMIC-III dataset. The performance of TocBERT has been evaluated on a human-labeled ground truth corpus of 250 notes. It achieved an F1-score of 84.6% when evaluated on a linear text segmentation problem and 72.8% on a hierarchical text segmentation problem. It outperformed a carefully designed rule-based solution, particularly in distinguishing titles from subtitles.

Related papers

Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech [61.00008468914252]
We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation.<n> benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing.<n>Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript.<n>Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost.
arXiv Detail & Related papers (2025-12-30T23:29:51Z)
LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query. We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask. We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z)
Lexically Grounded Subword Segmentation [0.0]
We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an method for obtaining subword embeddings grounded in a word embedding space. Third, we introduce an efficient segmentation algorithm based on a subword bigram model.
arXiv Detail & Related papers (2024-06-19T13:48:19Z)
From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse. We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z)
Integrating Bidirectional Long Short-Term Memory with Subword Embedding for Authorship Attribution [2.3429306644730854]
Manifold word-based stylistic markers have been successfully used in deep learning methods to deal with the intrinsic problem of authorship attribution. The proposed method was experimentally evaluated against numerous state-of-the-art methods across the public corporal of CCAT50, IMDb62, Blog50, and Twitter50.
arXiv Detail & Related papers (2023-06-26T11:35:47Z)
Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages. Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points. By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z)
Influence Functions for Sequence Tagging Models [49.81774968547377]
We extend influence functions to trace predictions back to the training points that informed them. We show the practical utility of segment influence by using the method to identify systematic annotation errors.
arXiv Detail & Related papers (2022-10-25T17:13:11Z)
Many-Class Text Classification with Matching [65.74328417321738]
We formulate textbfText textbfClassification as a textbfMatching problem between the text and the labels, and propose a simple yet effective framework named TCM. Compared with previous text classification approaches, TCM takes advantage of the fine-grained semantic information of the classification labels.
arXiv Detail & Related papers (2022-05-23T15:51:19Z)
An Interpretable End-to-end Fine-tuning Approach for Long Clinical Text [72.62848911347466]
Unstructured clinical text in EHRs contains crucial information for applications including decision support, trial matching, and retrospective research. Recent work has applied BERT-based models to clinical information extraction and text classification, given these models' state-of-the-art performance in other NLP domains. In this work, we propose a novel fine-tuning approach called SnipBERT. Instead of using entire notes, SnipBERT identifies crucial snippets and feeds them into a truncated BERT-based model in a hierarchical manner.
arXiv Detail & Related papers (2020-11-12T17:14:32Z)
Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector [0.0]
Bidirectional Long Short-Term Memory Network (Bi-LSTM) has shown promising performance in sentiment classification task. We propose the using of an existing document representation method called paragraph vector as additional input features for Bi-LSTM.
arXiv Detail & Related papers (2020-09-12T03:43:30Z)
Disentangling Human Error from the Ground Truth in Segmentation of Medical Images [12.009437407687987]
We present a method for jointly learning, from purely noisy observations alone, the reliability of individual annotators and the true segmentation label distributions. We demonstrate the utility of the method on three public medical imaging segmentation datasets with simulated (when necessary) and real diverse annotations.
arXiv Detail & Related papers (2020-07-31T11:03:12Z)
Incorporating Uncertain Segmentation Information into Chinese NER for Social Media Text [18.455836845989523]
segmentation error propagation is a challenge for Chinese named entity recognition systems. We propose a model (UIcwsNN) that specializes in identifying entities from Chinese social media text.
arXiv Detail & Related papers (2020-04-14T09:39:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.