Related papers: CodeMind: A Framework to Challenge Large Language Models for Code Reasoning

CodeMind: A Framework to Challenge Large Language Models for Code Reasoning

URL: http://arxiv.org/abs/2402.09664v4
Date: Wed, 3 Apr 2024 06:23:48 GMT
Title: CodeMind: A Framework to Challenge Large Language Models for Code Reasoning
Authors: Changshu Liu, Shizhuo Dylan Zhang, Ali Reza Ibrahimzada, Reyhaneh Jabbarvand,
Abstract summary: We introduce CodeMind, a framework designed to gauge the code reasoning abilities of Large Language Models (LLMs) CodeMind supports three code reasoning tasks: Independent Execution Reasoning (IER), Dependent Execution Reasoning (DER), and Specification Reasoning (SR)
Score: 1.4027589547318842
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Solely relying on test passing to evaluate Large Language Models (LLMs) for code synthesis may result in unfair assessment or promoting models with data leakage. As an alternative, we introduce CodeMind, a framework designed to gauge the code reasoning abilities of LLMs. CodeMind currently supports three code reasoning tasks: Independent Execution Reasoning (IER), Dependent Execution Reasoning (DER), and Specification Reasoning (SR). The first two evaluate models to predict the execution output of an arbitrary code or code the model could correctly synthesize. The third one evaluates the extent to which LLMs implement the specified expected behavior. Our extensive evaluation of nine LLMs across five benchmarks in two different programming languages using CodeMind shows that LLMs fairly follow control flow constructs and, in general, explain how inputs evolve to output, specifically for simple programs and the ones they can correctly synthesize. However, their performance drops for code with higher complexity, non-trivial logical and arithmetic operators, non-primitive types, and API calls. Furthermore, we observe that, while correlated, specification reasoning (essential for code synthesis) does not imply execution reasoning (essential for broader programming tasks such as testing and debugging): ranking LLMs based on test passing can be different compared to code reasoning.

Related papers

IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs [53.00384299879513]
In large language models (LLMs), code and reasoning reinforce each other. Code provides verifiable execution paths, enforces logical decomposition, and enables runtime validation. We identify key challenges and propose future research directions to strengthen this synergy.
arXiv Detail & Related papers (2025-02-26T18:55:42Z)
A Tool for In-depth Analysis of Code Execution Reasoning of Large Language Models [1.644043499620662]
This paper introduces ExeRScope, a series of tools to analyze the result of code execution reasoning frameworks. Analysis can be generalized to code with similar properties without the urgent need to design more benchmarks.
arXiv Detail & Related papers (2025-01-30T16:56:08Z)
Evaluating and Aligning CodeLLMs on Human Preference [42.26173776584043]
We present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks. We also propose a diverse synthetic instruction corpus SynCode-Instruct to verify the effectiveness of the large-scale synthetic instruction fine-tuning. The results find performance differences between execution-based benchmarks and CodeArena.
arXiv Detail & Related papers (2024-12-06T17:40:38Z)
Case2Code: Learning Inductive Reasoning with Synthetic Data [105.89741089673575]
We propose a textbfCase2Code task by exploiting the expressiveness and correctness of programs. We first evaluate representative LLMs on the synthesized Case2Code task and demonstrate that the Case-to-code induction is challenging for LLMs. Experimental results show that such induction training benefits not only in distribution Case2Code performance but also enhances various coding abilities of trained LLMs.
arXiv Detail & Related papers (2024-07-17T11:35:00Z)
Source Code Summarization in the Era of Large Language Models [23.715005053430957]
Large language models (LLMs) have led to a great boost in the performance of code-related tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs.
arXiv Detail & Related papers (2024-07-09T05:48:42Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks [1.3586572110652484]
This study explores the capabilities of Large Language Models (LLMs) in retrieving contextual information from large text documents. Our benchmark, Bug In The Code Stack (BICS), is designed to assess the ability of LLMs to identify simple syntax bugs within large source code. Our findings reveal three key insights: (1) code-based environments pose significantly more challenge compared to text-based environments for retrieval tasks, (2) there is a substantial performance disparity among different models, and (3) there is a notable correlation between longer context lengths and performance degradation.
arXiv Detail & Related papers (2024-06-21T17:37:10Z)
Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. Code reasoning is one of the most essential abilities of code LLMs. We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code. We find that code prompting exhibits a high-performance boost for multiple LLMs. Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society. In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC. We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z)
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models [43.655927559990616]
We propose CodeApex, a benchmark dataset focusing on the programming comprehension, code generation, and code correction abilities of LLMs. We evaluate 12 widely used LLMs, including both general-purpose and specialized models. GPT-4 exhibits the best programming capabilities, achieving approximate accuracy of 69%, 54%, and 66% on the three tasks, respectively.
arXiv Detail & Related papers (2023-09-05T04:12:01Z)
Coarse-Tuning Models of Code with Reinforcement Learning Feedback [0.0]
Large Language Models (LLMs) pre-trained on code have emerged as the dominant approach to program synthesis. We propose RLCF, that further trains a pre-trained LLM via reinforcement learning, using feedback from a grounding function that scores the quality of the code.
arXiv Detail & Related papers (2023-05-25T22:09:08Z)
LMs: Understanding Code Syntax and Semantics for Code Analysis [25.508254718438636]
We evaluate the capabilities of large language models (LLMs) and their limitations for code analysis in software engineering. We employ four state-of-the-art foundational models, GPT4, GPT3.5, StarCoder and CodeLlama-13b-instruct.
arXiv Detail & Related papers (2023-05-20T08:43:49Z)
CodeT5+: Open Code Large Language Models for Code Understanding and Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.