Related papers: CoCoNUT: Structural Code Understanding does not fall out of a tree

CoCoNUT: Structural Code Understanding does not fall out of a tree

URL: http://arxiv.org/abs/2501.16456v2
Date: Wed, 29 Jan 2025 05:15:45 GMT
Title: CoCoNUT: Structural Code Understanding does not fall out of a tree
Authors: Claas Beger, Saikat Dutta,
Abstract summary: Large Language Models (LLMs) have shown impressive performance across a wide array of tasks involving both structured and unstructured textual data.<n>Recent results on various benchmarks for code generation, repair, or completion suggest that certain models have programming abilities comparable to or even surpass humans.<n>We demonstrate that high performance on such benchmarks does not correlate to humans' innate ability to understand structural control flow in code.
Score: 4.081759833067852
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have shown impressive performance across a wide array of tasks involving both structured and unstructured textual data. Recent results on various benchmarks for code generation, repair, or completion suggest that certain models have programming abilities comparable to or even surpass humans. In this work, we demonstrate that high performance on such benchmarks does not correlate to humans' innate ability to understand structural control flow in code. To this end, we extract solutions from the HumanEval benchmark, which the relevant models perform strongly on, and trace their execution path using function calls sampled from the respective test set. Using this dataset, we investigate the ability of seven state-of-the-art LLMs to match the execution trace and find that, despite their ability to generate semantically identical code, they possess limited ability to trace execution paths, especially for longer traces and specific control structures. We find that even the top-performing model, Gemini, can fully and correctly generate only 47% of HumanEval task traces. Additionally, we introduce a subset for three key structures not contained in HumanEval: Recursion, Parallel Processing, and Object-Oriented Programming, including concepts like Inheritance and Polymorphism. Besides OOP, we show that none of the investigated models achieve an accuracy over 5% on the relevant traces. Aggregating these specialized parts with HumanEval tasks, we present CoCoNUT: Code Control Flow for Navigation Understanding and Testing, which measures a model's ability to trace execution of code upon relevant calls, including advanced structural components. We conclude that current LLMs need significant improvement to enhance code reasoning abilities. We hope our dataset helps researchers bridge this gap.

Related papers

EpiCoder: Encompassing Diversity and Complexity in Code Generation [49.170195362149386]
We introduce a novel feature tree-based synthesis framework inspired by Abstract Syntax Trees (AST)<n>Unlike AST, which captures syntactic structure of code, our framework models semantic relationships between code elements.<n>We fine-tuned widely-used base models to create the EpiCoder series, achieving state-of-the-art performance at both the function and file levels.
arXiv Detail & Related papers (2025-01-08T18:58:15Z)
Large Language Models as Realistic Microservice Trace Generators [54.85489678342595]
Workload traces are essential to understand complex computer systems' behavior and manage processing and memory resources. This paper proposes a first-of-a-kind approach that relies on training a large language model to generate synthetic workload traces. Our model adapts to downstream trace-related tasks, such as predicting key trace features and infilling missing data.
arXiv Detail & Related papers (2024-12-16T12:48:04Z)
CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters. Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework. Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z)
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [106.11371409170818]
Large language models (LLMs) can act as agents with capabilities to self-refine and improve generated code autonomously. We propose CodeTree, a framework for LLM agents to efficiently explore the search space in different stages of the code generation process. Specifically, we adopted a unified tree structure to explicitly explore different coding strategies, generate corresponding coding solutions, and subsequently refine the solutions.
arXiv Detail & Related papers (2024-11-07T00:09:54Z)
Instructive Code Retriever: Learn from Large Language Model's Feedback for Code Intelligence Tasks [10.867880635762395]
We introduce a novel approach named Instructive Code Retriever (ICR) ICR is designed to retrieve examples that enhance model inference across various code intelligence tasks and datasets. We evaluate our model's effectiveness on various tasks, i.e., code summarization, program synthesis, and bug fixing.
arXiv Detail & Related papers (2024-10-15T05:44:00Z)
Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. We also present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms. We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z)
COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis [29.667170755786508]
We introduce EVAL, a benchmark for evaluating the abilities of Large Language Models. We propose the COmmunicative Agent-based data SynThesis framework, which employs a multi-agent system to generate high-quality training data. Results demonstrate that COAST-generated data outperform human-curated and GPT-4-generated data.
arXiv Detail & Related papers (2024-08-09T11:35:44Z)
CompCodeVet: A Compiler-guided Validation and Enhancement Approach for Code Dataset [12.58750209611099]
Even models with billions of parameters face challenges in tasks demanding multi-step reasoning. CompCodeVet is a compiler-guided CoT approach to produce compilable code from non-compilable ones.
arXiv Detail & Related papers (2023-11-11T08:21:52Z)
CodeCoT: Tackling Code Syntax Errors in CoT Reasoning for Code Generation [6.139760107605468]
Chain-of-thought (CoT) has emerged as a groundbreaking tool in NLP, notably for its efficacy in complex reasoning tasks. We present Code Chain-of-Thought (CodeCoT) that integrates CoT with a self-examination process for code generation.
arXiv Detail & Related papers (2023-08-17T04:58:51Z)
GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.