Related papers: Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers

Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers

URL: http://arxiv.org/abs/2210.05481v1
Date: Tue, 11 Oct 2022 14:32:46 GMT
Title: Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers
Authors: Odunayo Ogundepo, Xinyu Zhang, and Jimmy Lin
Abstract summary: We propose a new approach to tokenization for lexical matching retrieval algorithms. We use the WordPiece tokenizer, which can be built automatically from unsupervised data. Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
Score: 48.036317742487796
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tokenization is a crucial step in information retrieval, especially for lexical matching algorithms, where the quality of indexable tokens directly impacts the effectiveness of a retrieval system. Since different languages have unique properties, the design of the tokenization algorithm is usually language-specific and requires at least some lingustic knowledge. However, only a handful of the 7000+ languages on the planet benefit from specialized, custom-built tokenization algorithms, while the other languages are stuck with a "default" whitespace tokenizer, which cannot capture the intricacies of different languages. To address this challenge, we propose a different approach to tokenization for lexical matching retrieval algorithms (e.g., BM25): using the WordPiece tokenizer, which can be built automatically from unsupervised data. We test the approach on 11 typologically diverse languages in the MrTyDi collection: results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages. In many cases, our approach also improves retrieval effectiveness when combined with existing custom-built tokenizers.

Related papers

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z)
Comparative analysis of subword tokenization approaches for Indian languages [5.012314384895538]
Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process.<n>Subword tokenization enhances this process by breaking down words into smaller subword units.<n>It is useful in capturing the intricate structure of words in Indian languages (ILs), such as prefixes, suffixes, and other morphological variations.<n>This paper examines how different subword tokenization techniques, such as SentencePiece, Byte Pair, and WordPiece Tokenization, affect ILs.
arXiv Detail & Related papers (2025-05-22T16:24:37Z)
Beyond Literal Token Overlap: Token Alignability for Multilinguality [53.680462160878925]
We propose subword token alignability as a new way to understand the impact and quality of multilingual tokenisation. In particular, this metric predicts multilinguality much better when scripts are disparate and the overlap of literal tokens is low. We recommend our subword token alignability metric for identifying optimal language pairs for cross-lingual transfer.
arXiv Detail & Related papers (2025-02-10T13:50:12Z)
Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively. However, Whisper struggles with unseen languages, those not included in its pre-training. We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z)
From Language Models over Tokens to Language Models over Characters [54.123846188068384]
Modern language models are internally -- and mathematically -- distributions over token strings rather than emphcharacter strings. This paper presents algorithms for converting token-level language models to character-level ones.
arXiv Detail & Related papers (2024-12-04T21:19:20Z)
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs. We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z)
Egalitarian Language Representation in Language Models: It All Begins with Tokenizers [0.0]
We show that not all tokenizers offer fair representation for complex script languages such as Tamil, Sinhala, and Hindi. We introduce an improvement to the Byte Pair algorithm by incorporating graphemes, which we term Grapheme Pair. Our experiments show that grapheme-based character extraction outperforms byte-level tokenizers for complex scripts.
arXiv Detail & Related papers (2024-09-17T19:05:37Z)
Constructing a BPE Tokenization DFA [0.0]
Many natural language processing systems operate over tokenizations of text to address the open-vocabulary problem. We give and analyze an algorithm for the efficient construction of deterministic finite automata designed to operate directly on tokenizations produced by the popular byte pair encoding technique.
arXiv Detail & Related papers (2024-05-13T11:59:24Z)
How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese [4.259342268820457]
This paper investigates the effect of tokenizers on the downstream performance of pretrained language models (PLMs) in scriptio continua languages where no explicit spaces exist between words. The tokenizer for such languages often consists of a morphological analyzer and a subword tokenizer, requiring us to conduct a comprehensive study of all possible pairs. We train extensive sets of tokenizers, build a PLM using each, and measure the downstream performance on a wide range of tasks.
arXiv Detail & Related papers (2023-06-16T01:22:32Z)
A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning [8.052271364177988]
Subword tokenization is a commonly used input pre-processing step in most recent NLP models. We propose a vocabulary-free neural tokenizer by distilling segmentation information from subword tokenization. Our tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks.
arXiv Detail & Related papers (2022-04-22T16:50:49Z)
Improving Tokenisation by Alternative Treatment of Spaces [7.596737214110957]
We experiment with an alternative tokenisation approach where spaces are always treated as individual tokens. We find that our modified algorithms lead to improved performance on downstream NLP tasks.
arXiv Detail & Related papers (2022-04-08T13:22:30Z)
Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages. We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data. Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z)
Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations. Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z)
Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks. Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it. In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z)
Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world. We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z)
On the Importance of Word Order Information in Cross-lingual Sequence Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages. We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.