Preventing Language Models From Hiding Their Reasoning
- URL: http://arxiv.org/abs/2310.18512v2
- Date: Tue, 31 Oct 2023 19:13:43 GMT
- Title: Preventing Language Models From Hiding Their Reasoning
- Authors: Fabien Roger, Ryan Greenblatt
- Abstract summary: Large language models (LLMs) often benefit from intermediate steps of reasoning to generate answers to complex problems.
In this work, we focus on one potential way intermediate steps of reasoning could be unfaithful: encoded reasoning.
We show that language models can be trained to make use of encoded reasoning to get higher performance without the user understanding the intermediate steps of reasoning.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) often benefit from intermediate steps of
reasoning to generate answers to complex problems. When these intermediate
steps of reasoning are used to monitor the activity of the model, it is
essential that this explicit reasoning is faithful, i.e. that it reflects what
the model is actually reasoning about. In this work, we focus on one potential
way intermediate steps of reasoning could be unfaithful: encoded reasoning,
where an LLM could encode intermediate steps of reasoning in the generated text
in a way that is not understandable to human readers. We show that language
models can be trained to make use of encoded reasoning to get higher
performance without the user understanding the intermediate steps of reasoning.
We argue that, as language models get stronger, this behavior becomes more
likely to appear naturally. Finally, we describe a methodology that enables the
evaluation of defenses against encoded reasoning, and show that, under the
right conditions, paraphrasing successfully prevents even the best encoding
schemes we built from encoding more than 3 bits of information per KB of text.
Related papers
- P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains [97.25943550933829]
We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains.
We use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities.
arXiv Detail & Related papers (2024-10-11T19:22:57Z) - Implicit Chain of Thought Reasoning via Knowledge Distillation [58.80851216530288]
Instead of explicitly producing the chain of thought reasoning steps, we use the language model's internal hidden states to perform implicit reasoning.
We find that this approach enables solving tasks previously not solvable without explicit chain-of-thought, at a speed comparable to no chain-of-thought.
arXiv Detail & Related papers (2023-11-02T17:59:49Z) - Empower Nested Boolean Logic via Self-Supervised Curriculum Learning [67.46052028752327]
We find that any pre-trained language models even including large language models only behave like a random selector in the face of multi-nested logic.
To empower language models with this fundamental capability, this paper proposes a new self-supervised learning method textitCurriculum Logical Reasoning (textscClr)
arXiv Detail & Related papers (2023-10-09T06:54:02Z) - Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic [19.476840373850653]
Large language models show hallucinations as their reasoning procedures are unconstrained by logical principles.
We propose LoT (Logical Thoughts), a self-improvement prompting framework that leverages principles rooted in symbolic logic.
Experimental evaluations conducted on language tasks in diverse domains, including arithmetic, commonsense, symbolic, causal inference, and social problems, demonstrate the efficacy of enhanced reasoning by logic.
arXiv Detail & Related papers (2023-09-23T11:21:12Z) - Meta-Reasoning: Semantics-Symbol Deconstruction for Large Language Models [34.22393697176282]
We propose the Meta-Reasoning to broaden symbolic methods' applicability and adaptability in the real world.
This method empowers LLMs to deconstruct reasoning-independent semantic information into generic symbolic representations.
We conduct extensive experiments on more than ten datasets encompassing conventional reasoning tasks like arithmetic, symbolic, and logical reasoning, and the more complex interactive reasoning tasks like theory-of-mind reasoning.
arXiv Detail & Related papers (2023-06-30T17:38:10Z) - Deductive Verification of Chain-of-Thought Reasoning [22.79166959432764]
Large Language Models (LLMs) benefit from Chain-of-Thought prompting in performing various reasoning tasks.
While CoT allows models to produce more comprehensive reasoning processes, its emphasis on intermediate reasoning steps can inadvertently introduce hallucinations and accumulated errors.
We propose Natural Program, a natural language-based deductive reasoning format.
arXiv Detail & Related papers (2023-06-06T17:18:56Z) - The Magic of IF: Investigating Causal Reasoning Abilities in Large
Language Models of Code [74.3873029963285]
Causal reasoning, the ability to identify cause-and-effect relationship, is crucial in human thinking.
We show that Code-LLMs with code prompts are significantly better in causal reasoning.
arXiv Detail & Related papers (2023-05-30T17:02:58Z) - Learning to Reason and Memorize with Self-Notes [51.17609489687686]
Large language models have been shown to struggle with multi-step reasoning.
We propose a simple method for solving both of these problems by allowing the model to take Self-Notes.
arXiv Detail & Related papers (2023-05-01T14:02:48Z) - Chain of Thought Prompting Elicits Reasoning in Large Language Models [56.811278668446825]
This paper explores the ability of language models to generate a coherent chain of thought.
Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks.
arXiv Detail & Related papers (2022-01-28T02:33:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.