Lost in Translation: Large Language Models in Non-English Content
Analysis
- URL: http://arxiv.org/abs/2306.07377v1
- Date: Mon, 12 Jun 2023 19:10:47 GMT
- Title: Lost in Translation: Large Language Models in Non-English Content
Analysis
- Authors: Gabriel Nicholas and Aliya Bhatia
- Abstract summary: Large language models have become the dominant approach for building AI systems to analyze and generate language online.
Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa,
Google's PaLM) have become the dominant approach for building AI systems to
analyze and generate language online. However, the automated systems that
increasingly mediate our interactions online -- such as chatbots, content
moderation systems, and search engines -- are primarily designed for and work
far more effectively in English than in the world's other 7,000 languages.
Recently, researchers and technology companies have attempted to extend the
capabilities of large language models into languages other than English by
building what are called multilingual language models.
In this paper, we explain how these multilingual language models work and
explore their capabilities and limits. Part I provides a simple technical
explanation of how large language models work, why there is a gap in available
data between English and other languages, and how multilingual language models
attempt to bridge that gap. Part II accounts for the challenges of doing
content analysis with large language models in general and multilingual
language models in particular. Part III offers recommendations for companies,
researchers, and policymakers to keep in mind when considering researching,
developing and deploying large and multilingual language models.
Related papers
- Counterfactually Probing Language Identity in Multilingual Models [15.260518230218414]
We use AlterRep, a method of counterfactual probing, to explore the internal structure of multilingual models.
We find that, given a template in Language X, pushing towards Language Y systematically increases the probability of Language Y words.
arXiv Detail & Related papers (2023-10-29T01:21:36Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - The Less the Merrier? Investigating Language Representation in
Multilingual Models [8.632506864465501]
We investigate the linguistic representation of different languages in multilingual models.
We observe from our experiments that community-centered models perform better at distinguishing between languages in the same family for low-resource languages.
arXiv Detail & Related papers (2023-10-20T02:26:34Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean
Language Models [6.907247943327277]
Polyglot is a pioneering project aimed at enhancing the non-English language performance of multilingual language models.
We introduce the Polyglot Korean models, which represent a specific focus rather than being multilingual in nature.
arXiv Detail & Related papers (2023-06-04T04:04:04Z) - Lessons learned from the evaluation of Spanish Language Models [27.653133576469276]
We present a head-to-head comparison of language models for Spanish with the following results.
We argue for the need of more research to understand the factors underlying them.
The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem.
arXiv Detail & Related papers (2022-12-16T10:33:38Z) - Overcoming Language Disparity in Online Content Classification with
Multimodal Learning [22.73281502531998]
Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks.
The development of advanced computational techniques and resources is disproportionately focused on the English language.
We explore the promise of incorporating the information contained in images via multimodal machine learning.
arXiv Detail & Related papers (2022-05-19T17:56:02Z) - Crossing the Conversational Chasm: A Primer on Multilingual
Task-Oriented Dialogue Systems [51.328224222640614]
Current state-of-the-art ToD models based on large pretrained neural language models are data hungry.
Data acquisition for ToD use cases is expensive and tedious.
arXiv Detail & Related papers (2021-04-17T15:19:56Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - XPersona: Evaluating Multilingual Personalized Chatbot [76.00426517401894]
We propose a multi-lingual extension of Persona-Chat, namely XPersona.
Our dataset includes persona conversations in six different languages other than English for building and evaluating multilingual personalized agents.
arXiv Detail & Related papers (2020-03-17T07:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.