Better Than Whitespace: Information Retrieval for Languages without
Custom Tokenizers
- URL: http://arxiv.org/abs/2210.05481v1
- Date: Tue, 11 Oct 2022 14:32:46 GMT
- Title: Better Than Whitespace: Information Retrieval for Languages without
Custom Tokenizers
- Authors: Odunayo Ogundepo, Xinyu Zhang, and Jimmy Lin
- Abstract summary: We propose a new approach to tokenization for lexical matching retrieval algorithms.
We use the WordPiece tokenizer, which can be built automatically from unsupervised data.
Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
- Score: 48.036317742487796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tokenization is a crucial step in information retrieval, especially for
lexical matching algorithms, where the quality of indexable tokens directly
impacts the effectiveness of a retrieval system. Since different languages have
unique properties, the design of the tokenization algorithm is usually
language-specific and requires at least some lingustic knowledge. However, only
a handful of the 7000+ languages on the planet benefit from specialized,
custom-built tokenization algorithms, while the other languages are stuck with
a "default" whitespace tokenizer, which cannot capture the intricacies of
different languages. To address this challenge, we propose a different approach
to tokenization for lexical matching retrieval algorithms (e.g., BM25): using
the WordPiece tokenizer, which can be built automatically from unsupervised
data. We test the approach on 11 typologically diverse languages in the MrTyDi
collection: results show that the mBERT tokenizer provides strong relevance
signals for retrieval "out of the box", outperforming whitespace tokenization
on most languages. In many cases, our approach also improves retrieval
effectiveness when combined with existing custom-built tokenizers.
Related papers
- Egalitarian Language Representation in Language Models: It All Begins with Tokenizers [0.0]
We show that not all tokenizers offer fair representation for complex script languages such as Tamil, Sinhala, and Hindi.
We introduce an improvement to the Byte Pair algorithm by incorporating graphemes, which we term Grapheme Pair.
Our experiments show that grapheme-based character extraction outperforms byte-level tokenizers for complex scripts.
arXiv Detail & Related papers (2024-09-17T19:05:37Z) - Constructing a BPE Tokenization DFA [0.0]
Many natural language processing systems operate over tokenizations of text to address the open-vocabulary problem.
We give and analyze an algorithm for the efficient construction of deterministic finite automata designed to operate directly on tokenizations produced by the popular byte pair encoding technique.
arXiv Detail & Related papers (2024-05-13T11:59:24Z) - How do different tokenizers perform on downstream tasks in scriptio
continua languages?: A case study in Japanese [4.259342268820457]
This paper investigates the effect of tokenizers on the downstream performance of pretrained language models (PLMs) in scriptio continua languages where no explicit spaces exist between words.
The tokenizer for such languages often consists of a morphological analyzer and a subword tokenizer, requiring us to conduct a comprehensive study of all possible pairs.
We train extensive sets of tokenizers, build a PLM using each, and measure the downstream performance on a wide range of tasks.
arXiv Detail & Related papers (2023-06-16T01:22:32Z) - A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task
Learning [8.052271364177988]
Subword tokenization is a commonly used input pre-processing step in most recent NLP models.
We propose a vocabulary-free neural tokenizer by distilling segmentation information from subword tokenization.
Our tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks.
arXiv Detail & Related papers (2022-04-22T16:50:49Z) - Improving Tokenisation by Alternative Treatment of Spaces [7.596737214110957]
We experiment with an alternative tokenisation approach where spaces are always treated as individual tokens.
We find that our modified algorithms lead to improved performance on downstream NLP tasks.
arXiv Detail & Related papers (2022-04-08T13:22:30Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.