Comparing Variation in Tokenizer Outputs Using a Series of Problematic
and Challenging Biomedical Sentences
- URL: http://arxiv.org/abs/2305.08787v1
- Date: Mon, 15 May 2023 16:46:47 GMT
- Title: Comparing Variation in Tokenizer Outputs Using a Series of Problematic
and Challenging Biomedical Sentences
- Authors: Christopher Meaney, Therese A Stukel, Peter C Austin, Michael Escobar
- Abstract summary: The objective of this study is to explore variation in tokenizer outputs when applied across a series of challenging biomedical sentences.
The tokenizers compared in this study are the NLTK white space tokenizer, the NLTK Penn Tree Bank tokenizer, Spacy and SciSpacy tokenizers, Stanza/Stanza-Craft tokenizers, the UDPipe tokenizer, and R-tokenizers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Background & Objective: Biomedical text data are increasingly available for
research. Tokenization is an initial step in many biomedical text mining
pipelines. Tokenization is the process of parsing an input biomedical sentence
(represented as a digital character sequence) into a discrete set of word/token
symbols, which convey focused semantic/syntactic meaning. The objective of this
study is to explore variation in tokenizer outputs when applied across a series
of challenging biomedical sentences.
Method: Diaz [2015] introduce 24 challenging example biomedical sentences for
comparing tokenizer performance. In this study, we descriptively explore
variation in outputs of eight tokenizers applied to each example biomedical
sentence. The tokenizers compared in this study are the NLTK white space
tokenizer, the NLTK Penn Tree Bank tokenizer, Spacy and SciSpacy tokenizers,
Stanza/Stanza-Craft tokenizers, the UDPipe tokenizer, and R-tokenizers.
Results: For many examples, tokenizers performed similarly effectively;
however, for certain examples, there were meaningful variation in returned
outputs. The white space tokenizer often performed differently than other
tokenizers. We observed performance similarities for tokenizers implementing
rule-based systems (e.g. pattern matching and regular expressions) and
tokenizers implementing neural architectures for token classification.
Oftentimes, the challenging tokens resulting in the greatest variation in
outputs, are those words which convey substantive and focused
biomedical/clinical meaning (e.g. x-ray, IL-10, TCR/CD3, CD4+ CD8+, and
(Ca2+)-regulated).
Conclusion: When state-of-the-art, open-source tokenizers from Python and R
were applied to a series of challenging biomedical example sentences, we
observed subtle variation in the returned outputs.
Related papers
- Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization [3.0023392750520883]
My submission explores whether morphological segmentation methods can be used as a part of subword tokenizers.
The prediction results show that morphological segmentation could be as effective as commonly used subword tokenizers.
A tokenizer with a balanced token frequency distribution tends to work better.
arXiv Detail & Related papers (2024-10-19T04:06:09Z) - SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [68.68025991850115]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - Tokenization Is More Than Compression [14.939912120571728]
Existing tokenization approaches like Byte-Pair.
(BPE) originate from the field of data compression.
We introduce PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z) - Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic
Representations [102.05351905494277]
Sub-sentence encoder is a contrastively-learned contextual embedding model for fine-grained semantic representation of text.
We show that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders.
arXiv Detail & Related papers (2023-11-07T20:38:30Z) - Analyzing Cognitive Plausibility of Subword Tokenization [9.510439539246846]
Subword tokenization has become the de-facto standard for tokenization.
We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization.
arXiv Detail & Related papers (2023-10-20T08:25:37Z) - Better Than Whitespace: Information Retrieval for Languages without
Custom Tokenizers [48.036317742487796]
We propose a new approach to tokenization for lexical matching retrieval algorithms.
We use the WordPiece tokenizer, which can be built automatically from unsupervised data.
Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
arXiv Detail & Related papers (2022-10-11T14:32:46Z) - Extracting Grammars from a Neural Network Parser for Anomaly Detection
in Unknown Formats [79.6676793507792]
Reinforcement learning has recently shown promise as a technique for training an artificial neural network to parse sentences in some unknown format.
This paper presents procedures for extracting production rules from the neural network, and for using these rules to determine whether a given sentence is nominal or anomalous.
arXiv Detail & Related papers (2021-07-30T23:10:24Z) - A Case Study of Spanish Text Transformations for Twitter Sentiment
Analysis [1.9694608733361543]
Sentiment analysis is a text mining task that determines the polarity of a given text, i.e., its positiveness or negativeness.
New forms of textual expressions present new challenges to analyze text given the use of slang, orthographic and grammatical errors.
arXiv Detail & Related papers (2021-06-03T17:24:31Z) - Automating the Compilation of Potential Core-Outcomes for Clinical
Trials [0.0]
The objective of this paper is to describe an automated method utilizing natural language processing in order to describe the probable core outcomes of clinical trials.
In addition to BioBERT, an unsupervised feature-based approach making use of only the encoder output embedding representations was utilized.
This method was able to both harness the domain-specific context of each of the tokens from the learned embeddings of the BioBERT model as well as a more stable metric of sentence similarity.
arXiv Detail & Related papers (2021-01-11T18:14:49Z) - Syntactic representation learning for neural network based TTS with
syntactic parse tree traversal [49.05471750563229]
We propose a syntactic representation learning method based on syntactic parse tree to automatically utilize the syntactic structure information.
Experimental results demonstrate the effectiveness of our proposed approach.
For sentences with multiple syntactic parse trees, prosodic differences can be clearly perceived from the synthesized speeches.
arXiv Detail & Related papers (2020-12-13T05:52:07Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.