Geographical Erasure in Language Generation
- URL: http://arxiv.org/abs/2310.14777v1
- Date: Mon, 23 Oct 2023 10:26:14 GMT
- Title: Geographical Erasure in Language Generation
- Authors: Pola Schw\"obel, Jacek Golebiowski, Michele Donini, C\'edric
Archambeau, Danish Pruthi
- Abstract summary: We study and operationalise a form of geographical erasure, wherein language models underpredict certain countries.
We discover that erasure strongly correlates with low frequencies of country mentions in the training corpus.
We mitigate erasure by finetuning using a custom objective.
- Score: 13.219867587151986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) encode vast amounts of world knowledge. However,
since these models are trained on large swaths of internet data, they are at
risk of inordinately capturing information about dominant groups. This
imbalance can propagate into generated language. In this work, we study and
operationalise a form of geographical erasure, wherein language models
underpredict certain countries. We demonstrate consistent instances of erasure
across a range of LLMs. We discover that erasure strongly correlates with low
frequencies of country mentions in the training corpus. Lastly, we mitigate
erasure by finetuning using a custom objective.
Related papers
- Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.
For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.
We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - Randomly Sampled Language Reasoning Problems Reveal Limits of LLMs [8.146860674148044]
We attempt to measure models' language understanding capacity while circumventing the risk of dataset recall.
We parameterize large families of language tasks recognized by deterministic finite automata (DFAs)
We find that, even in the strikingly simple setting of 3-state DFAs, LLMs underperform un parameterized ngram models on both language recognition and synthesis tasks.
arXiv Detail & Related papers (2025-01-06T07:57:51Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - On the Scaling Laws of Geographical Representation in Language Models [0.11510009152620666]
We show that geographical knowledge is observable even for tiny models, and that it scales consistently as we increase the model size.
Notably, we observe that larger language models cannot mitigate the geographical bias that is inherent to the training data.
arXiv Detail & Related papers (2024-02-29T18:04:11Z) - Paloma: A Benchmark for Evaluating Language Model Fit [112.481957296585]
Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training.
We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains.
arXiv Detail & Related papers (2023-12-16T19:12:45Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - Geographic and Geopolitical Biases of Language Models [43.62238334380897]
We propose an approach to study the geographic bias (and knowledge) present in pretrained language models (PLMs)
Our findings suggest PLMs' representations map surprisingly well to the physical world in terms of country-to-country associations.
Last, we explain how large PLMs despite exhibiting notions of geographical proximity, over-amplify geopoliticalitism at inference time.
arXiv Detail & Related papers (2022-12-20T16:32:54Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Do Language Models Know the Way to Rome? [4.344337854565144]
We exploit the fact that in geography, ground truths are available beyond local relations.
We find that language models generally encode limited geographic information, but with larger models performing the best.
arXiv Detail & Related papers (2021-09-16T13:28:16Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.