Related papers: Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?

Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?

URL: http://arxiv.org/abs/2311.07587v2
Date: Wed, 15 Nov 2023 19:49:02 GMT
Title: Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?
Authors: C. Daniel Freeman, Laura Culp, Aaron Parisi, Maxwell L Bileschi, Gamaleldin F Elsayed, Alex Rizkowsky, Isabelle Simpson, Alex Alemi, Azade Nova, Ben Adlam, Bernd Bohnet, Gaurav Mishra, Hanie Sedghi, Igor Mordatch, Izzeddin Gur, Jaehoon Lee, JD Co-Reyes, Jeffrey Pennington, Kelvin Xu, Kevin Swersky, Kshiteej Mahajan, Lechao Xiao, Rosanne Liu, Simon Kornblith, Noah Constant, Peter J. Liu, Roman Novak, Yundi Qian, Noah Fiedel, Jascha Sohl-Dickstein
Abstract summary: We study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment. This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete. We show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops.
Score: 88.59136033348378
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment. This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete. Even in the simple setting of 1-digit addition problems, it is easy to find adversarial prompts that make all tested models (including PaLM2, GPT4, Claude2) misbehave, and even to steer models to a particular wrong answer. We additionally provide a simple algorithm for finding successful attacks by querying those same models, which we name "prompt inversion rejection sampling" (PIRS). We finally show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops. However, we were not able to make a language model fully robust against adversarial arithmetic attacks.

Related papers

Self-Questioning Language Models [51.75087358141567]
We propose an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver.<n>Both the proposer and solver are trained via reinforcement learning.<n>We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces.
arXiv Detail & Related papers (2025-08-05T17:51:33Z)
Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.497499123166804]
This work studies the performance of frontier language models on a broad set of "easy" reasoning problems.<n>We create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning.<n>We show that even state-of-the-art thinking models consistently fail on such problems and for similar reasons.
arXiv Detail & Related papers (2025-07-09T22:22:49Z)
Stream of Search (SoS): Learning to Search in Language [29.841835308845948]
We show how language models can be taught to search by representing the process of search in language as a flattened string. We propose a unified language for search that captures an array of different symbolic search strategies. Our results indicate that language models can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.
arXiv Detail & Related papers (2024-04-01T06:50:52Z)
Large Language Models Struggle with Unreasonability in Math Problems [41.970853209666224]
Large Language Models (LLMs) have shown remarkable success on a wide range of math and reasoning benchmarks.<n>We observe that they often struggle when faced with unreasonable math problems.<n>We propose the textbfUnreasonable Math Problems (UMP) benchmark, designed to evaluate LLMs' ability to detect and respond to unreasonable math problem statements.
arXiv Detail & Related papers (2024-03-28T12:04:28Z)
WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models [35.088946378980914]
We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) We show that these models make errors even with as few as three objects. Errors persist even with chain-of-thought prompting and in-context learning.
arXiv Detail & Related papers (2023-11-27T15:38:17Z)
Are aligned neural networks adversarially aligned? [93.91072860401856]
adversarial users can construct inputs which circumvent attempts at alignment. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
arXiv Detail & Related papers (2023-06-26T17:18:44Z)
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions. Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z)
TASA: Deceiving Question Answering Models by Twin Answer Sentences Attack [93.50174324435321]
We present Twin Answer Sentences Attack (TASA), an adversarial attack method for question answering (QA) models. TASA produces fluent and grammatical adversarial contexts while maintaining gold answers.
arXiv Detail & Related papers (2022-10-27T07:16:30Z)
Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models. We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
A Differentiable Language Model Adversarial Attack on Text Classifiers [10.658675415759697]
We propose a new black-box sentence-level attack for natural language processing. Our method fine-tunes a pre-trained language model to generate adversarial examples. We show that the proposed attack outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation.
arXiv Detail & Related papers (2021-07-23T14:43:13Z)
Explain2Attack: Text Adversarial Attacks via Cross-Domain Interpretability [18.92690624514601]
Research has shown that down-stream models can be easily fooled with adversarial inputs that look like the training data, but slightly perturbed, in a way imperceptible to humans. In this paper, we propose Explain2Attack, a black-box adversarial attack on text classification task. We show that our framework either achieves or out-performs attack rates of the state-of-the-art models, yet with lower queries cost and higher efficiency.
arXiv Detail & Related papers (2020-10-14T04:56:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.