Related papers: Comparing Variation in Tokenizer Outputs Using a Series of Problematic and Challenging Biomedical Sentences

Comparing Variation in Tokenizer Outputs Using a Series of Problematic and Challenging Biomedical Sentences

URL: http://arxiv.org/abs/2305.08787v1
Date: Mon, 15 May 2023 16:46:47 GMT
Title: Comparing Variation in Tokenizer Outputs Using a Series of Problematic and Challenging Biomedical Sentences
Authors: Christopher Meaney, Therese A Stukel, Peter C Austin, Michael Escobar
Abstract summary: The objective of this study is to explore variation in tokenizer outputs when applied across a series of challenging biomedical sentences. The tokenizers compared in this study are the NLTK white space tokenizer, the NLTK Penn Tree Bank tokenizer, Spacy and SciSpacy tokenizers, Stanza/Stanza-Craft tokenizers, the UDPipe tokenizer, and R-tokenizers.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Background & Objective: Biomedical text data are increasingly available for research. Tokenization is an initial step in many biomedical text mining pipelines. Tokenization is the process of parsing an input biomedical sentence (represented as a digital character sequence) into a discrete set of word/token symbols, which convey focused semantic/syntactic meaning. The objective of this study is to explore variation in tokenizer outputs when applied across a series of challenging biomedical sentences. Method: Diaz [2015] introduce 24 challenging example biomedical sentences for comparing tokenizer performance. In this study, we descriptively explore variation in outputs of eight tokenizers applied to each example biomedical sentence. The tokenizers compared in this study are the NLTK white space tokenizer, the NLTK Penn Tree Bank tokenizer, Spacy and SciSpacy tokenizers, Stanza/Stanza-Craft tokenizers, the UDPipe tokenizer, and R-tokenizers. Results: For many examples, tokenizers performed similarly effectively; however, for certain examples, there were meaningful variation in returned outputs. The white space tokenizer often performed differently than other tokenizers. We observed performance similarities for tokenizers implementing rule-based systems (e.g. pattern matching and regular expressions) and tokenizers implementing neural architectures for token classification. Oftentimes, the challenging tokens resulting in the greatest variation in outputs, are those words which convey substantive and focused biomedical/clinical meaning (e.g. x-ray, IL-10, TCR/CD3, CD4+ CD8+, and (Ca2+)-regulated). Conclusion: When state-of-the-art, open-source tokenizers from Python and R were applied to a series of challenging biomedical example sentences, we observed subtle variation in the returned outputs.

Related papers

Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization [3.0023392750520883]
My submission explores whether morphological segmentation methods can be used as a part of subword tokenizers. The prediction results show that morphological segmentation could be as effective as commonly used subword tokenizers. A tokenizer with a balanced token frequency distribution tends to work better.
arXiv Detail & Related papers (2024-10-19T04:06:09Z)
SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP) SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings. Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z)
Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal [58.29382184006158]
We propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE method. On extensive experiments across language modeling and even machine translation, Scaffold-BPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-04-27T07:12:07Z)
Tokenization Is More Than Compression [14.939912120571728]
Existing tokenization approaches like Byte-Pair. (BPE) originate from the field of data compression. We introduce PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z)
Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations [102.05351905494277]
Sub-sentence encoder is a contrastively-learned contextual embedding model for fine-grained semantic representation of text. We show that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders.
arXiv Detail & Related papers (2023-11-07T20:38:30Z)
Analyzing Cognitive Plausibility of Subword Tokenization [9.510439539246846]
Subword tokenization has become the de-facto standard for tokenization. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization.
arXiv Detail & Related papers (2023-10-20T08:25:37Z)
Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers [48.036317742487796]
We propose a new approach to tokenization for lexical matching retrieval algorithms. We use the WordPiece tokenizer, which can be built automatically from unsupervised data. Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
arXiv Detail & Related papers (2022-10-11T14:32:46Z)
Extracting Grammars from a Neural Network Parser for Anomaly Detection in Unknown Formats [79.6676793507792]
Reinforcement learning has recently shown promise as a technique for training an artificial neural network to parse sentences in some unknown format. This paper presents procedures for extracting production rules from the neural network, and for using these rules to determine whether a given sentence is nominal or anomalous.
arXiv Detail & Related papers (2021-07-30T23:10:24Z)
A Case Study of Spanish Text Transformations for Twitter Sentiment Analysis [1.9694608733361543]
Sentiment analysis is a text mining task that determines the polarity of a given text, i.e., its positiveness or negativeness. New forms of textual expressions present new challenges to analyze text given the use of slang, orthographic and grammatical errors.
arXiv Detail & Related papers (2021-06-03T17:24:31Z)
Automating the Compilation of Potential Core-Outcomes for Clinical Trials [0.0]
The objective of this paper is to describe an automated method utilizing natural language processing in order to describe the probable core outcomes of clinical trials. In addition to BioBERT, an unsupervised feature-based approach making use of only the encoder output embedding representations was utilized. This method was able to both harness the domain-specific context of each of the tokens from the learned embeddings of the BioBERT model as well as a more stable metric of sentence similarity.
arXiv Detail & Related papers (2021-01-11T18:14:49Z)
Syntactic representation learning for neural network based TTS with syntactic parse tree traversal [49.05471750563229]
We propose a syntactic representation learning method based on syntactic parse tree to automatically utilize the syntactic structure information. Experimental results demonstrate the effectiveness of our proposed approach. For sentences with multiple syntactic parse trees, prosodic differences can be clearly perceived from the synthesized speeches.
arXiv Detail & Related papers (2020-12-13T05:52:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.