Do All Languages Cost the Same? Tokenization in the Era of Commercial
Language Models
- URL: http://arxiv.org/abs/2305.13707v1
- Date: Tue, 23 May 2023 05:46:45 GMT
- Title: Do All Languages Cost the Same? Tokenization in the Era of Commercial
Language Models
- Authors: Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R.
Mortensen, Noah A. Smith, Yulia Tsvetkov
- Abstract summary: API vendors charge their users based on usage, more specifically on the number of tokens'' processed or generated by the underlying language models.
What constitutes a token, however, is training data and model dependent with a large variance in the number of tokens required to convey the same information in different languages.
We conduct a systematic analysis of the cost and utility of OpenAI's language model API on multilingual benchmarks in 22 typologically diverse languages.
- Score: 68.29126169579132
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models have graduated from being research prototypes to
commercialized products offered as web APIs, and recent works have highlighted
the multilingual capabilities of these products. The API vendors charge their
users based on usage, more specifically on the number of ``tokens'' processed
or generated by the underlying language models. What constitutes a token,
however, is training data and model dependent with a large variance in the
number of tokens required to convey the same information in different
languages. In this work, we analyze the effect of this non-uniformity on the
fairness of an API's pricing policy across languages. We conduct a systematic
analysis of the cost and utility of OpenAI's language model API on multilingual
benchmarks in 22 typologically diverse languages. We show evidence that
speakers of a large number of the supported languages are overcharged while
obtaining poorer results. These speakers tend to also come from regions where
the APIs are less affordable to begin with. Through these analyses, we aim to
increase transparency around language model APIs' pricing policies and
encourage the vendors to make them more equitable.
Related papers
- Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model [14.39119862985503]
We aim to create a multilingual ALT system with available datasets.
Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario.
We evaluate the performance of the multilingual model in comparison to its monolingual counterparts.
arXiv Detail & Related papers (2024-06-25T15:02:32Z) - The Less the Merrier? Investigating Language Representation in
Multilingual Models [8.632506864465501]
We investigate the linguistic representation of different languages in multilingual models.
We observe from our experiments that community-centered models perform better at distinguishing between languages in the same family for low-resource languages.
arXiv Detail & Related papers (2023-10-20T02:26:34Z) - Language Model Tokenizers Introduce Unfairness Between Languages [98.92630681729518]
We show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked.
Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs.
We make the case that we should train future language models using multilingually fair subword tokenizers.
arXiv Detail & Related papers (2023-05-17T14:17:57Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - cViL: Cross-Lingual Training of Vision-Language Models using Knowledge
Distillation [6.381149074212897]
We propose a pipeline that utilizes English-only vision-language models to train a monolingual model for a target language.
We release a large-scale visual question answering dataset in Japanese and Hindi language.
Our pipeline outperforms the current state-of-the-art models by a relative increase of 4.4% and 13.4% respectively in accuracy.
arXiv Detail & Related papers (2022-06-07T14:46:30Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Cross-lingual Emotion Detection [6.767035411834297]
We consider English as the source language with Arabic and Spanish as target languages.
Our BERT-based monolingual models that are trained on target language data surpass state-of-the-art (SOTA) by 4% and 5% absolute Jaccard score for Arabic and Spanish respectively.
Next, we show that using cross-lingual approaches with English data alone, we can achieve more than 90% and 80% relative effectiveness of the Arabic and Spanish BERT models respectively.
arXiv Detail & Related papers (2021-06-10T19:52:06Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.