Related papers: Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs)

Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs)

URL: http://arxiv.org/abs/2410.03568v1
Date: Fri, 4 Oct 2024 16:18:29 GMT
Title: Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs)
Authors: Abrar Rahman, Garry Bowlin, Binit Mohanty, Sean McGunigal,
Abstract summary: This paper presents a study on the tokenization techniques employed by state-of-the-art large language models (LLMs) The study evaluates the tokenization variability observed across these models and investigates the challenges of linguistic representation in subword tokenization. This research aims to promote generalizable Internationalization (I18N) practices in the development of AI services in this domain and beyond.
Score: 0.09374652839580183
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: This paper presents a comprehensive study on the tokenization techniques employed by state-of-the-art large language models (LLMs) and their implications on the cost and availability of services across different languages, especially low resource languages. The analysis considers multiple LLMs, including GPT-4 (using cl100k_base embeddings), GPT-3 (with p50k_base embeddings), and DaVinci (employing r50k_base embeddings), as well as the widely used BERT base tokenizer. The study evaluates the tokenization variability observed across these models and investigates the challenges of linguistic representation in subword tokenization. The research underscores the importance of fostering linguistically-aware development practices, especially for languages that are traditionally under-resourced. Moreover, this paper introduces case studies that highlight the real-world implications of tokenization choices, particularly in the context of electronic health record (EHR) systems. This research aims to promote generalizable Internationalization (I18N) practices in the development of AI services in this domain and beyond, with a strong emphasis on inclusivity, particularly for languages traditionally underrepresented in AI applications.

Related papers

Multilingual Self-Taught Faithfulness Evaluators [11.200203292660758]
Self-Taught Evaluators for Multilingual Faithfulness is a framework that learns exclusively from synthetic multilingual summarization data.<n>Our framework shows improvements over existing baselines, including state-of-the-art English evaluators and machine translation-based approaches.
arXiv Detail & Related papers (2025-07-28T12:01:59Z)
Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages [0.0]
This paper presents a comprehensive evaluation of tokenizers used by 12 Large Language Models (LLMs) across all 22 official languages of India. The SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models.
arXiv Detail & Related papers (2024-11-19T05:37:17Z)
Responsible Multilingual Large Language Models: A Survey of Development, Applications, and Societal Impact [5.803667039914564]
This work bridges the gap by providing an end-to-end framework for developing and deploying MLLMs in production environments. Our findings reveal critical challenges in supporting linguistic diversity, with 88.38% of world languages categorized as low-resource. This survey provides essential guidance for practitioners and researchers working to develop more inclusive and effective multilingual AI systems.
arXiv Detail & Related papers (2024-10-23T03:19:15Z)
Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models [0.0]
The quality of tokenization can significantly impact a model's ability to handle diverse languages effectively. We introduce Qtok, a tool designed to assess tokenizer quality with a specific emphasis on their performance in multilingual contexts. Qtok applies these metrics to evaluate 13 distinct tokenizers from 58 publicly available models, analyzing their output across different linguistic contexts.
arXiv Detail & Related papers (2024-10-16T19:34:34Z)
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models [50.459861376459656]
EMMA-500 is a large-scale multilingual language model continue-trained on texts across 546 languages. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity.
arXiv Detail & Related papers (2024-09-26T14:40:45Z)
Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts. Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models. The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z)
Problematic Tokens: Tokenizer Bias in Large Language Models [4.7245503050933335]
This paper traces the roots of disparities to the tokenization process inherent to large language models. Specifically, it explores how the tokenizers vocabulary, often used to speed up the tokenization process, inadequately represents non-English languages. We aim to dissect the tokenization mechanics of GPT-4o, illustrating how its simplified token-handling methods amplify associated security and ethical issues.
arXiv Detail & Related papers (2024-06-17T05:13:25Z)
MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models [0.5822010906632046]
This study introduces MultiPragEval, the first pragmatic evaluation of Large Language Models (LLMs) Comprising 1200 question units categorized according to Grice's Cooperative Principle, MultiPragEval enables an in-depth assessment of LLMs' contextual awareness and their ability to infer implied meanings. Our findings demonstrate that Claude3-Opus significantly outperforms other models in all tested languages, establishing a state-of-the-art in the field.
arXiv Detail & Related papers (2024-06-11T21:46:03Z)
No Language is an Island: Unifying Chinese and English in Financial Large Language Models, Instruction Data, and Benchmarks [75.29561463156635]
ICE-PIXIU uniquely integrates a spectrum of Chinese tasks, alongside translated and original English datasets. It provides unrestricted access to diverse model variants, a compilation of diverse cross-lingual and multi-modal instruction data, and an evaluation benchmark with expert annotations.
arXiv Detail & Related papers (2024-03-10T16:22:20Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey [100.24095818099522]
Large language models (LLMs) have significantly advanced the field of natural language processing (NLP) They provide a highly useful, task-agnostic foundation for a wide range of applications. However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles.
arXiv Detail & Related papers (2023-05-30T03:00:30Z)
Cross-lingual Lifelong Learning [53.06904052325966]
We present a principled Cross-lingual Continual Learning (CCL) evaluation paradigm. We provide insights into what makes multilingual sequential learning particularly challenging. The implications of this analysis include a recipe for how to measure and balance different cross-lingual continual learning desiderata.
arXiv Detail & Related papers (2022-05-23T09:25:43Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.