Frontier Language Models are not Robust to Adversarial Arithmetic, or
"What do I need to say so you agree 2+2=5?
- URL: http://arxiv.org/abs/2311.07587v2
- Date: Wed, 15 Nov 2023 19:49:02 GMT
- Title: Frontier Language Models are not Robust to Adversarial Arithmetic, or
"What do I need to say so you agree 2+2=5?
- Authors: C. Daniel Freeman, Laura Culp, Aaron Parisi, Maxwell L Bileschi,
Gamaleldin F Elsayed, Alex Rizkowsky, Isabelle Simpson, Alex Alemi, Azade
Nova, Ben Adlam, Bernd Bohnet, Gaurav Mishra, Hanie Sedghi, Igor Mordatch,
Izzeddin Gur, Jaehoon Lee, JD Co-Reyes, Jeffrey Pennington, Kelvin Xu, Kevin
Swersky, Kshiteej Mahajan, Lechao Xiao, Rosanne Liu, Simon Kornblith, Noah
Constant, Peter J. Liu, Roman Novak, Yundi Qian, Noah Fiedel, Jascha
Sohl-Dickstein
- Abstract summary: We study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment.
This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete.
We show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops.
- Score: 88.59136033348378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce and study the problem of adversarial arithmetic, which provides
a simple yet challenging testbed for language model alignment. This problem is
comprised of arithmetic questions posed in natural language, with an arbitrary
adversarial string inserted before the question is complete. Even in the simple
setting of 1-digit addition problems, it is easy to find adversarial prompts
that make all tested models (including PaLM2, GPT4, Claude2) misbehave, and
even to steer models to a particular wrong answer. We additionally provide a
simple algorithm for finding successful attacks by querying those same models,
which we name "prompt inversion rejection sampling" (PIRS). We finally show
that models can be partially hardened against these attacks via reinforcement
learning and via agentic constitutional loops. However, we were not able to
make a language model fully robust against adversarial arithmetic attacks.
Related papers
- Stream of Search (SoS): Learning to Search in Language [29.841835308845948]
We show how language models can be taught to search by representing the process of search in language as a flattened string.
We propose a unified language for search that captures an array of different symbolic search strategies.
Our results indicate that language models can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.
arXiv Detail & Related papers (2024-04-01T06:50:52Z) - WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large
Language Models [35.088946378980914]
We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat)
We show that these models make errors even with as few as three objects.
Errors persist even with chain-of-thought prompting and in-context learning.
arXiv Detail & Related papers (2023-11-27T15:38:17Z) - Are aligned neural networks adversarially aligned? [93.91072860401856]
adversarial users can construct inputs which circumvent attempts at alignment.
We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models.
We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
arXiv Detail & Related papers (2023-06-26T17:18:44Z) - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions.
Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z) - TASA: Deceiving Question Answering Models by Twin Answer Sentences
Attack [93.50174324435321]
We present Twin Answer Sentences Attack (TASA), an adversarial attack method for question answering (QA) models.
TASA produces fluent and grammatical adversarial contexts while maintaining gold answers.
arXiv Detail & Related papers (2022-10-27T07:16:30Z) - Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models.
We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - A Differentiable Language Model Adversarial Attack on Text Classifiers [10.658675415759697]
We propose a new black-box sentence-level attack for natural language processing.
Our method fine-tunes a pre-trained language model to generate adversarial examples.
We show that the proposed attack outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation.
arXiv Detail & Related papers (2021-07-23T14:43:13Z) - Explain2Attack: Text Adversarial Attacks via Cross-Domain
Interpretability [18.92690624514601]
Research has shown that down-stream models can be easily fooled with adversarial inputs that look like the training data, but slightly perturbed, in a way imperceptible to humans.
In this paper, we propose Explain2Attack, a black-box adversarial attack on text classification task.
We show that our framework either achieves or out-performs attack rates of the state-of-the-art models, yet with lower queries cost and higher efficiency.
arXiv Detail & Related papers (2020-10-14T04:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.