Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs
- URL: http://arxiv.org/abs/2401.10065v2
- Date: Sun, 25 Feb 2024 22:59:07 GMT
- Title: Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs
- Authors: Haritz Puerto, Martin Tutek, Somak Aditya, Xiaodan Zhu, Iryna Gurevych
- Abstract summary: We introduce code prompting, a chain of prompts that transforms a natural language problem into code.
We find that code prompting exhibits a high-performance boost for multiple LLMs.
Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
- Score: 69.99031792995348
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Reasoning is a fundamental component of language understanding. Recent
prompting techniques, such as chain of thought, have consistently improved
LLMs' performance on various reasoning tasks. Nevertheless, there is still
little understanding of what triggers reasoning abilities in LLMs in the
inference stage. In this paper, we introduce code prompting, a chain of prompts
that transforms a natural language problem into code and directly prompts the
LLM using the generated code without resorting to external code execution. We
hypothesize that code prompts can elicit certain reasoning capabilities of LLMs
trained on text and code and utilize the proposed method to improve conditional
reasoning, the ability to infer different conclusions depending on the
fulfillment of certain conditions. We find that code prompting exhibits a
high-performance boost for multiple LLMs (up to 22.52 percentage points on GPT
3.5, 7.75 on Mixtral, and 16.78 on Mistral) across multiple conditional
reasoning datasets. We then conduct comprehensive experiments to understand how
code prompts trigger reasoning abilities and which capabilities are elicited in
the underlying models. Our analysis of GPT 3.5 reveals that the code formatting
of the input problem is essential for performance improvement. Furthermore,
code prompts improve sample efficiency of in-context learning and facilitate
state tracking of variables or entities.
Related papers
- Case2Code: Learning Inductive Reasoning with Synthetic Data [105.89741089673575]
We propose a textbfCase2Code task by exploiting the expressiveness and correctness of programs.
We first evaluate representative LLMs on the synthesized Case2Code task and demonstrate that the Case-to-code induction is challenging for LLMs.
Experimental results show that such induction training benefits not only in distribution Case2Code performance but also enhances various coding abilities of trained LLMs.
arXiv Detail & Related papers (2024-07-17T11:35:00Z) - Source Code Summarization in the Era of Large Language Models [23.715005053430957]
Large language models (LLMs) have led to a great boost in the performance of code-related tasks.
In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs.
arXiv Detail & Related papers (2024-07-09T05:48:42Z) - SemCoder: Training Code Language Models with Comprehensive Semantics [24.93484793667691]
We introduce a novel strategy to train Code LLMs with comprehensive semantics.
We propose training Code LLMs to write code and represent and reason about execution behaviors using natural language.
We show that our approach integrates semantics from multiple dimensions more smoothly.
arXiv Detail & Related papers (2024-06-03T05:36:57Z) - Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities.
Code reasoning is one of the most essential abilities of code LLMs.
We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - A & B == B & A: Triggering Logical Reasoning Failures in Large Language
Models [65.86149763739141]
We introduce LogicAsker, an automatic approach that comprehensively evaluates and improves the logical reasoning abilities of LLMs.
We evaluate LogicAsker on six widely deployed LLMs, including GPT-3, ChatGPT, GPT-4, Bard, Vicuna, and Guanaco.
The results show that test cases from LogicAsker can find logical reasoning failures in different LLMs with a rate of 25% - 94%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z) - At Which Training Stage Does Code Data Help LLMs Reasoning? [21.74241875923737]
This paper explores the impact of code data on Large Language Models (LLMs) at different stages.
Pre-training LLMs with the mixture of code and text can significantly enhance LLMs' general reasoning capability.
At the instruction-tuning stage, code data endows LLMs the task-specific reasoning capability.
arXiv Detail & Related papers (2023-09-28T09:50:27Z) - Test-Case-Driven Programming Understanding in Large Language Models for
Better Code Generation [15.166827643436346]
muFiX is a novel prompting technique to improve the code generation performance of large language models (LLMs)
It first exploits test case analysis to obtain specification understanding and enables a self-improvement process.
muFiX further fixes the specification understanding towards the direction reducing the gap between the provided understanding and the actual understanding.
arXiv Detail & Related papers (2023-09-28T02:58:07Z) - Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large
Language Models [74.95486528482327]
We explore code prompting, a neural symbolic prompting method with both zero-shot and few-shot versions which triggers code as intermediate steps.
We conduct experiments on 7 widely-used benchmarks involving symbolic reasoning and arithmetic reasoning.
arXiv Detail & Related papers (2023-05-29T15:14:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.