Related papers: Preventing Language Models From Hiding Their Reasoning

Preventing Language Models From Hiding Their Reasoning

URL: http://arxiv.org/abs/2310.18512v2
Date: Tue, 31 Oct 2023 19:13:43 GMT
Title: Preventing Language Models From Hiding Their Reasoning
Authors: Fabien Roger, Ryan Greenblatt
Abstract summary: Large language models (LLMs) often benefit from intermediate steps of reasoning to generate answers to complex problems. In this work, we focus on one potential way intermediate steps of reasoning could be unfaithful: encoded reasoning. We show that language models can be trained to make use of encoded reasoning to get higher performance without the user understanding the intermediate steps of reasoning.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) often benefit from intermediate steps of reasoning to generate answers to complex problems. When these intermediate steps of reasoning are used to monitor the activity of the model, it is essential that this explicit reasoning is faithful, i.e. that it reflects what the model is actually reasoning about. In this work, we focus on one potential way intermediate steps of reasoning could be unfaithful: encoded reasoning, where an LLM could encode intermediate steps of reasoning in the generated text in a way that is not understandable to human readers. We show that language models can be trained to make use of encoded reasoning to get higher performance without the user understanding the intermediate steps of reasoning. We argue that, as language models get stronger, this behavior becomes more likely to appear naturally. Finally, we describe a methodology that enables the evaluation of defenses against encoded reasoning, and show that, under the right conditions, paraphrasing successfully prevents even the best encoding schemes we built from encoding more than 3 bits of information per KB of text.

Related papers

Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models [1.249418440326334]
Generative large language models as tools in the legal domain have the potential to improve the justice system.<n>However, the reasoning behavior of current generative models is brittle and poorly understood, hence cannot be responsibly applied in the domains of law and evidence.<n>We introduce an approach for creating benchmarks that can be used to evaluate the reasoning capabilities of generative language models.
arXiv Detail & Related papers (2025-05-02T19:04:34Z)
On Explaining (Large) Language Models For Code Using Global Code-Based Explanations [45.126233498200534]
Language Models for Code (LLM4Code) have significantly changed the landscape of software engineering (SE) We introduce code rationales (Code$Q$), a technique with rigorous mathematical underpinning, to identify subsets of tokens that can explain individual code predictions. Our evaluation demonstrates that Code$Q$ is a powerful interpretability method to explain how (less) meaningful input concepts (i.e., natural language particle at') highly impact output generation.
arXiv Detail & Related papers (2025-03-21T01:00:45Z)
LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation [1.2576388595811496]
We introduce LINGOLY-TOO, a challenging reasoning benchmark grounded in natural language.<n>We permute reasoning problems written in real languages to generate numerous question variations.<n>Experiments and analyses show that models can circumvent reasoning and answer from prior knowledge.
arXiv Detail & Related papers (2025-03-04T19:57:47Z)
P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains [97.25943550933829]
We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains. We use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities.
arXiv Detail & Related papers (2024-10-11T19:22:57Z)
Implicit Chain of Thought Reasoning via Knowledge Distillation [58.80851216530288]
Instead of explicitly producing the chain of thought reasoning steps, we use the language model's internal hidden states to perform implicit reasoning. We find that this approach enables solving tasks previously not solvable without explicit chain-of-thought, at a speed comparable to no chain-of-thought.
arXiv Detail & Related papers (2023-11-02T17:59:49Z)
Empower Nested Boolean Logic via Self-Supervised Curriculum Learning [67.46052028752327]
We find that any pre-trained language models even including large language models only behave like a random selector in the face of multi-nested logic. To empower language models with this fundamental capability, this paper proposes a new self-supervised learning method textitCurriculum Logical Reasoning (textscClr)
arXiv Detail & Related papers (2023-10-09T06:54:02Z)
Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic [19.476840373850653]
Large language models show hallucinations as their reasoning procedures are unconstrained by logical principles. We propose LoT (Logical Thoughts), a self-improvement prompting framework that leverages principles rooted in symbolic logic. Experimental evaluations conducted on language tasks in diverse domains, including arithmetic, commonsense, symbolic, causal inference, and social problems, demonstrate the efficacy of enhanced reasoning by logic.
arXiv Detail & Related papers (2023-09-23T11:21:12Z)
Meta-Reasoning: Semantics-Symbol Deconstruction for Large Language Models [34.22393697176282]
We propose the Meta-Reasoning to broaden symbolic methods' applicability and adaptability in the real world. This method empowers LLMs to deconstruct reasoning-independent semantic information into generic symbolic representations. We conduct extensive experiments on more than ten datasets encompassing conventional reasoning tasks like arithmetic, symbolic, and logical reasoning, and the more complex interactive reasoning tasks like theory-of-mind reasoning.
arXiv Detail & Related papers (2023-06-30T17:38:10Z)
Deductive Verification of Chain-of-Thought Reasoning [22.79166959432764]
Large Language Models (LLMs) benefit from Chain-of-Thought prompting in performing various reasoning tasks. While CoT allows models to produce more comprehensive reasoning processes, its emphasis on intermediate reasoning steps can inadvertently introduce hallucinations and accumulated errors. We propose Natural Program, a natural language-based deductive reasoning format.
arXiv Detail & Related papers (2023-06-06T17:18:56Z)
The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code [74.3873029963285]
Causal reasoning, the ability to identify cause-and-effect relationship, is crucial in human thinking. We show that Code-LLMs with code prompts are significantly better in causal reasoning.
arXiv Detail & Related papers (2023-05-30T17:02:58Z)
Learning to Reason and Memorize with Self-Notes [51.17609489687686]
Large language models have been shown to struggle with multi-step reasoning. We propose a simple method for solving both of these problems by allowing the model to take Self-Notes.
arXiv Detail & Related papers (2023-05-01T14:02:48Z)
APOLLO: A Simple Approach for Adaptive Pretraining of Language Models for Logical Reasoning [73.3035118224719]
We propose APOLLO, an adaptively pretrained language model that has improved logical reasoning abilities. APOLLO performs comparably on ReClor and outperforms baselines on LogiQA.
arXiv Detail & Related papers (2022-12-19T07:40:02Z)
Chain of Thought Prompting Elicits Reasoning in Large Language Models [56.811278668446825]
This paper explores the ability of language models to generate a coherent chain of thought. Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks.
arXiv Detail & Related papers (2022-01-28T02:33:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.