In What Languages are Generative Language Models the Most Formal?
Analyzing Formality Distribution across Languages
- URL: http://arxiv.org/abs/2302.12299v1
- Date: Thu, 23 Feb 2023 19:39:52 GMT
- Title: In What Languages are Generative Language Models the Most Formal?
Analyzing Formality Distribution across Languages
- Authors: As{\i}m Ersoy, Gerson Vizcarra, Tasmiah Tahsin Mayeesha, Benjamin
Muller
- Abstract summary: In this work, we focus on one language property highly influenced by culture: formality.
We analyze the formality distributions of XGLM and BLOOM's predictions, two popular generative multilingual language models, in 5 languages.
We classify 1,200 generations per language as formal, informal, or incohesive and measure the impact of the prompt formality on the predictions.
- Score: 2.457872341625575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual generative language models (LMs) are increasingly fluent in a
large variety of languages. Trained on the concatenation of corpora in multiple
languages, they enable powerful transfer from high-resource languages to
low-resource ones. However, it is still unknown what cultural biases are
induced in the predictions of these models. In this work, we focus on one
language property highly influenced by culture: formality. We analyze the
formality distributions of XGLM and BLOOM's predictions, two popular generative
multilingual language models, in 5 languages. We classify 1,200 generations per
language as formal, informal, or incohesive and measure the impact of the
prompt formality on the predictions. Overall, we observe a diversity of
behaviors across the models and languages. For instance, XGLM generates
informal text in Arabic and Bengali when conditioned with informal prompts,
much more than BLOOM. In addition, even though both models are highly biased
toward the formal style when prompted neutrally, we find that the models
generate a significant amount of informal predictions even when prompted with
formal text. We release with this work 6,000 annotated samples, paving the way
for future work on the formality of generative multilingual LMs.
Related papers
- Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You [64.74707085021858]
We show that multilingual models suffer from significant gender biases just as monolingual models do.
We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models.
Our results show that not only do models exhibit strong gender biases but they also behave differently across languages.
arXiv Detail & Related papers (2024-01-29T12:02:28Z) - Machine Translation to Control Formality Features in the Target Language [0.9208007322096532]
This research explores how machine learning methods are used to translate from English to languages with formality.
It was done by training a bilingual model in a formality-controlled setting and comparing its performance with a pre-trained multilingual model.
We evaluate the official formality accuracy(ACC) by comparing the predicted masked tokens with the ground truth.
arXiv Detail & Related papers (2023-11-22T15:42:51Z) - The Less the Merrier? Investigating Language Representation in
Multilingual Models [8.632506864465501]
We investigate the linguistic representation of different languages in multilingual models.
We observe from our experiments that community-centered models perform better at distinguishing between languages in the same family for low-resource languages.
arXiv Detail & Related papers (2023-10-20T02:26:34Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Sabi\'a: Portuguese Large Language Models [14.801853435122908]
We show that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora.
Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin.
arXiv Detail & Related papers (2023-04-16T20:11:19Z) - Do Multilingual Language Models Capture Differing Moral Norms? [71.52261949766101]
Massively multilingual sentence representations are trained on large corpora of uncurated data.
This may cause the models to grasp cultural values including moral judgments from the high-resource languages.
The lack of data in certain languages can also lead to developing random and thus potentially harmful beliefs.
arXiv Detail & Related papers (2022-03-18T12:26:37Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Probing Multilingual Language Models for Discourse [0.0]
We find that the XLM-RoBERTa family of models consistently show the best performance.
Our results also indicate that model distillation may hurt the ability of cross-lingual transfer of sentence representations.
arXiv Detail & Related papers (2021-06-09T06:34:21Z) - Bilingual Language Modeling, A transfer learning technique for Roman
Urdu [0.0]
We show how code-switching property of languages may be used to perform cross-lingual transfer learning from a corresponding high resource language.
We also show how this transfer learning technique termed Bilingual Language Modeling can be used to produce better performing models for Roman Urdu.
arXiv Detail & Related papers (2021-02-22T12:56:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.