Evaluating Various Tokenizers for Arabic Text Classification
- URL: http://arxiv.org/abs/2106.07540v1
- Date: Mon, 14 Jun 2021 16:05:58 GMT
- Title: Evaluating Various Tokenizers for Arabic Text Classification
- Authors: Zaid Alyafeai, Maged S. Al-shaibani, Mustafa Ghaleb, Irfan Ahmad
- Abstract summary: We introduce three new tokenization algorithms for Arabic and compare them to three other baselines using unsupervised evaluations.
Our experiments show that the performance of such tokenization algorithms depends on the size of the dataset, type of the task, and the amount of morphology that exists in the dataset.
- Score: 4.110108749051656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The first step in any NLP pipeline is learning word vector representations.
However, given a large text corpus, representing all the words is not
efficient. In the literature, many tokenization algorithms have emerged to
tackle this problem by creating subwords which in turn limits the vocabulary
size in any text corpus. However such algorithms are mostly language-agnostic
and lack a proper way of capturing meaningful tokens. Not to mention the
difficulty of evaluating such techniques in practice. In this paper, we
introduce three new tokenization algorithms for Arabic and compare them to
three other baselines using unsupervised evaluations. In addition to that, we
compare all the six algorithms by evaluating them on three tasks which are
sentiment analysis, news classification and poetry classification. Our
experiments show that the performance of such tokenization algorithms depends
on the size of the dataset, type of the task, and the amount of morphology that
exists in the dataset.
Related papers
- Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Greed is All You Need: An Evaluation of Tokenizer Inference Methods [4.300681074103876]
We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary sizes.
We show that for the most commonly used tokenizers, greedy inference performs surprisingly well; and that SaGe, a recently-introduced contextually-informed tokenizer, outperforms all others on morphological alignment.
arXiv Detail & Related papers (2024-03-02T19:01:40Z) - Analyzing Cognitive Plausibility of Subword Tokenization [9.510439539246846]
Subword tokenization has become the de-facto standard for tokenization.
We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization.
arXiv Detail & Related papers (2023-10-20T08:25:37Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - A comparison of several AI techniques for authorship attribution on
Romanian texts [0.0]
We compare AI techniques for classifying literary texts written by multiple authors.
We also introduce a new dataset composed of texts written in the Romanian language on which we have run the algorithms.
arXiv Detail & Related papers (2022-11-09T20:24:48Z) - Better Than Whitespace: Information Retrieval for Languages without
Custom Tokenizers [48.036317742487796]
We propose a new approach to tokenization for lexical matching retrieval algorithms.
We use the WordPiece tokenizer, which can be built automatically from unsupervised data.
Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
arXiv Detail & Related papers (2022-10-11T14:32:46Z) - Improving Tokenisation by Alternative Treatment of Spaces [7.596737214110957]
We experiment with an alternative tokenisation approach where spaces are always treated as individual tokens.
We find that our modified algorithms lead to improved performance on downstream NLP tasks.
arXiv Detail & Related papers (2022-04-08T13:22:30Z) - ARTH: Algorithm For Reading Text Handily -- An AI Aid for People having
Word Processing Issues [0.0]
"ARTH" is a self-learning set of algorithms that is an intelligent way of fulfilling the need for "reading and understanding the text effortlessly"
The technology "ARTH" focuses on the revival of the joy of reading among those people, who have a poor vocabulary or any word processing issues.
arXiv Detail & Related papers (2021-01-23T09:39:45Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - TextScanner: Reading Characters in Order for Robust Scene Text
Recognition [60.04267660533966]
TextScanner is an alternative approach for scene text recognition.
It generates pixel-wise, multi-channel segmentation maps for character class, position and order.
It also adopts RNN for context modeling and performs paralleled prediction for character position and class.
arXiv Detail & Related papers (2019-12-28T07:52:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.