Related papers: Problematic Tokens: Tokenizer Bias in Large Language Models

Problematic Tokens: Tokenizer Bias in Large Language Models

URL: http://arxiv.org/abs/2406.11214v3
Date: Thu, 14 Nov 2024 03:53:56 GMT
Title: Problematic Tokens: Tokenizer Bias in Large Language Models
Authors: Jin Yang, Zhiqiang Wang, Yanbin Lin, Zunduo Zhao,
Abstract summary: This paper traces the roots of disparities to the tokenization process inherent to large language models. Specifically, it explores how the tokenizers vocabulary, often used to speed up the tokenization process, inadequately represents non-English languages. We aim to dissect the tokenization mechanics of GPT-4o, illustrating how its simplified token-handling methods amplify associated security and ethical issues.
Score: 4.7245503050933335
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in large language models(LLMs), such as GPT-4 and GPT-4o, have shown exceptional performance, especially in languages with abundant resources like English, thanks to extensive datasets that ensure robust training. Conversely, these models exhibit limitations when processing under-resourced languages such as Chinese and Korean, where issues including hallucinatory responses remain prevalent. This paper traces the roots of these disparities to the tokenization process inherent to these models. Specifically, it explores how the tokenizers vocabulary, often used to speed up the tokenization process and reduce tokens but constructed independently of the actual model training data, inadequately represents non-English languages. This misrepresentation results in the propagation of under-trained or untrained tokens, which perpetuate biases and pose serious concerns related to data security and ethical standards. We aim to dissect the tokenization mechanics of GPT-4o, illustrating how its simplified token-handling methods amplify these risks and offer strategic solutions to mitigate associated security and ethical issues. Through this study, we emphasize the critical need to rethink tokenization frameworks to foster more equitable and secure AI technologies. The code and data are available at https://github.com/yeyimilk/LLMGPT4o

Related papers

Tokens for Learning, Tokens for Unlearning: Mitigating Membership Inference Attacks in Large Language Models via Dual-Purpose Training [13.680205342714412]
Large language models (LLMs) have become the backbone of modern natural language processing but pose privacy concerns about leaking sensitive training data. We propose a lightweight yet effective empirical privacy defense for protecting training data of language modeling by leveraging the token-specific characteristics.
arXiv Detail & Related papers (2025-02-27T03:37:45Z)
Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs) We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length. PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z)
Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning [31.632816425798108]
Tokenization is a necessary component within the current architecture of many language models. We discuss how tokens and pretraining can act as a backdoor for bias and other unwanted content. We relay evidence that the tokenization algorithm's objective function impacts the large language model's cognition.
arXiv Detail & Related papers (2024-12-14T18:18:52Z)
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes. We present a novel framework for identifying these tokens through rollout sampling. We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z)
Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs) [0.09374652839580183]
This paper presents a study on the tokenization techniques employed by state-of-the-art large language models (LLMs) The study evaluates the tokenization variability observed across these models and investigates the challenges of linguistic representation in subword tokenization. This research aims to promote generalizable Internationalization (I18N) practices in the development of AI services in this domain and beyond.
arXiv Detail & Related papers (2024-10-04T16:18:29Z)
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z)
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models [4.165536532090932]
The disconnect between tokenizer creation and model training in language models allows for specific inputs, such as the infamous SolidGoldMagikarp token, to induce unwanted model behaviour. We present a comprehensive analysis of Large Language Model tokenizers, specifically targeting this issue of detecting under-trained tokens. Through a combination of tokenizer analysis, model weight-based indicators, and prompting techniques, we develop novel and effective methods for automatically detecting these problematic tokens.
arXiv Detail & Related papers (2024-05-08T20:37:56Z)
The first step is the hardest: Pitfalls of Representing and Tokenizing Temporal Data for Large Language Models [10.414206635385632]
Large Language Models (LLMs) have demonstrated remarkable generalization across diverse tasks. A notable obstacle emerges when feeding numerical/temporal data into these models, such as data sourced from wearables or electronic health records. We discuss recent works that employ LLMs for human-centric tasks such as in mobile health sensing and present a case study showing that popular LLMs tokenize temporal data incorrectly.
arXiv Detail & Related papers (2023-09-12T13:51:29Z)
Meta-Learning Online Adaptation of Language Models [88.8947656843812]
Large language models encode impressively broad world knowledge in their parameters. However, the knowledge in static language models falls out of date, limiting the model's effective "shelf life"
arXiv Detail & Related papers (2023-05-24T11:56:20Z)
Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model. We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z)
Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text. This leads to hundreds of millions of foreign language tokens in large-scale datasets. We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z)
TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning [19.682704309037653]
Masked language models (MLMs) have revolutionized the field of Natural Language Understanding. We propose TaCL (Token-aware Contrastive Learning), a novel continual pre-training approach that encourages BERT to learn an isotropic and discriminative distribution of token representations.
arXiv Detail & Related papers (2021-11-07T22:54:23Z)
Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model. In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z)
When Does Translation Require Context? A Data-driven, Multilingual Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT) Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation. We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z)
Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language [0.015863809575305417]
We introduce Word-Expressions-Based (WEB) tokenization, a human-involved super-words tokenization strategy to create a better representative vocabulary for training. We compare our tokenization strategy to others on the Fon-French and French-Fon translation tasks.
arXiv Detail & Related papers (2021-03-14T22:12:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.