Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models
- URL: http://arxiv.org/abs/2408.06518v2
- Date: Thu, 12 Sep 2024 18:33:33 GMT
- Title: Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models
- Authors: Hila Gonen, Terra Blevins, Alisa Liu, Luke Zettlemoyer, Noah A. Smith,
- Abstract summary: We identify and characterize a phenomenon never discussed before, where models leak irrelevant information from the prompt into the generation in unexpected ways.
We propose an evaluation setting to detect semantic leakage both by humans and automatically, curate a diverse test suite for diagnosing this behavior, and measure significant semantic leakage in 13 flagship models.
- Score: 113.58052868898173
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite their wide adoption, the biases and unintended behaviors of language models remain poorly understood. In this paper, we identify and characterize a phenomenon never discussed before, which we call semantic leakage, where models leak irrelevant information from the prompt into the generation in unexpected ways. We propose an evaluation setting to detect semantic leakage both by humans and automatically, curate a diverse test suite for diagnosing this behavior, and measure significant semantic leakage in 13 flagship models. We also show that models exhibit semantic leakage in languages besides English and across different settings and generation scenarios. This discovery highlights yet another type of bias in language models that affects their generation patterns and behavior.
Related papers
- Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models? [17.011882550422452]
It is unknown whether the nature of the instruction data has an impact on the model output.
It is questionable whether translated test sets can capture such nuances.
We show that native or generation benchmarks reveal a notable difference between native and translated instruction data.
arXiv Detail & Related papers (2024-06-18T17:43:47Z) - Multilingual large language models leak human stereotypes across language boundaries [25.903732543380528]
We study how training a model multilingually may lead to stereotypes expressed in one language showing up in the models' behaviour in another.
We propose a measurement framework for stereotype leakage and investigate its effect across English, Russian, Chinese, and Hindi.
We find that GPT-3.5 exhibits the most stereotype leakage, and Hindi is the most susceptible to leakage effects.
arXiv Detail & Related papers (2023-12-12T10:24:17Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Shapley Head Pruning: Identifying and Removing Interference in
Multilingual Transformers [54.4919139401528]
We show that it is possible to reduce interference by identifying and pruning language-specific parameters.
We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
arXiv Detail & Related papers (2022-10-11T18:11:37Z) - Uncovering Constraint-Based Behavior in Neural Models via Targeted
Fine-Tuning [9.391375268580806]
We show that competing linguistic processes within a language obscure underlying linguistic knowledge.
While human behavior has been found to be similar across languages, we find cross-linguistic variation in model behavior.
Our results suggest that models need to learn both the linguistic constraints in a language and their relative ranking, with mismatches in either producing non-human-like behavior.
arXiv Detail & Related papers (2021-06-02T14:52:11Z) - Provable Limitations of Acquiring Meaning from Ungrounded Form: What
will Future Language Models Understand? [87.20342701232869]
We investigate the abilities of ungrounded systems to acquire meaning.
We study whether assertions enable a system to emulate representations preserving semantic relations like equivalence.
We find that assertions enable semantic emulation if all expressions in the language are referentially transparent.
However, if the language uses non-transparent patterns like variable binding, we show that emulation can become an uncomputable problem.
arXiv Detail & Related papers (2021-04-22T01:00:17Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.