Related papers: Verbal Process Supervision Elicits Better Coding Agents

Verbal Process Supervision Elicits Better Coding Agents

URL: http://arxiv.org/abs/2503.18494v1
Date: Mon, 24 Mar 2025 09:48:59 GMT
Title: Verbal Process Supervision Elicits Better Coding Agents
Authors: Hao-Yuan Chen, Cheng-Pong Huang, Jui-Ming Yao,
Abstract summary: This work introduces CURA, a code understanding and reasoning agent system enhanced with verbal process supervision (VPS)<n>CURA achieves a 3.65% improvement over baseline models on challenging benchmarks like BigCodeBench.
Score: 0.9558392439655016
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The emergence of large language models and their applications as AI agents have significantly advanced state-of-the-art code generation benchmarks, transforming modern software engineering tasks. However, even with test-time computed reasoning models, these systems still struggle with complex software engineering challenges. This work introduces CURA, a code understanding and reasoning agent system enhanced with verbal process supervision (VPS), achieving a 3.65\% improvement over baseline models on challenging benchmarks like BigCodeBench. Furthermore, CURA, when paired with the o3-mini model and VPS techniques, attains state-of-the-art performance. This work represents a step forward in integrating reasoning-driven architectures with LLM-based code generation, enabling agentic reasoning for language models to solve complex software engineering tasks.

Related papers

Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages [0.1906498126334485]
This paper evaluates the capabilities of the Llama 2-70B model in automating scientific applications written in programming languages. We assess the model's capacity to generate code, documentation, and unit tests, as well as its ability to translate existing code between programming languages. Our results indicate that while Llama 2-70B frequently generates syntactically correct and functional code for simpler numerical tasks, it encounters substantial difficulties with more complex, parallelized, or distributed computations.
arXiv Detail & Related papers (2025-03-24T23:46:14Z)
Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs [53.00384299879513]
In large language models (LLMs), code and reasoning reinforce each other.<n>Code provides verifiable execution paths, enforces logical decomposition, and enables runtime validation.<n>We identify key challenges and propose future research directions to strengthen this synergy.
arXiv Detail & Related papers (2025-02-26T18:55:42Z)
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation [24.090719826360342]
We introduce CodeIF, the first benchmark designed to assess the abilities of Large Language Models (LLMs) to adhere to task-oriented instructions within code generation scenarios.<n>We conduct extensive experiments with LLMs, analyzing their strengths and limitations in meeting the demands of these tasks.
arXiv Detail & Related papers (2025-02-26T14:19:49Z)
Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks [1.9198713957364215]
Large Language Models (LLMs) have shown remarkable capabilities in code generation tasks.<n>They face significant limitations in handling complex, long-context programming challenges.<n>This paper introduces a novel agentic framework for guided code generation''
arXiv Detail & Related papers (2025-01-11T19:21:53Z)
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization [49.362750475706235]
Reinforcement Learning (RL) plays a crucial role in aligning large language models with human preferences and improving their ability to perform complex tasks.<n>We introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model.<n> Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.
arXiv Detail & Related papers (2024-10-11T23:29:20Z)
Retrieval-Augmented Instruction Tuning for Automated Process Engineering Calculations : A Tool-Chaining Problem-Solving Framework with Attributable Reflection [0.0]
We introduce a novel autonomous agent framework leveraging Retrieval-Augmented Instruction-Tuning (RAIT) to enhance open, customizable small code language models (SLMs) By combining instruction tuned code SLMs with Retrieval-Augmented Code Generation (RACG) using external tools, the agent generates, debugs, and optimize code from natural language specifications. Our approach addresses the limitations of the current lack of a foundational AI model for specialized process engineering tasks and offers benefits of explainability, knowledge editing, and cost-effectiveness.
arXiv Detail & Related papers (2024-08-28T15:33:47Z)
Knowledge Graph Modeling-Driven Large Language Model Operating System (LLM OS) for Task Automation in Process Engineering Problem-Solving [0.0]
We present the Process Engineering Operations Assistant (PEOA), an AI-driven framework designed to solve complex problems in the chemical and process industries. The framework employs a modular architecture orchestrated by a meta-agent, which serves as the central coordinator. The results demonstrate the framework effectiveness in automating calculations, accelerating prototyping, and providing AI-augmented decision support for industrial processes.
arXiv Detail & Related papers (2024-08-23T13:52:47Z)
Machine Learning Insides OptVerse AI Solver: Design Principles and Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver. We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem. We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z)
MoTCoder: Elevating Large Language Models with Modular of Thought for Challenging Programming Tasks [50.61968901704187]
We introduce a framework for MoT instruction tuning, designed to promote the decomposition of tasks into logical sub-tasks and sub-modules. Our investigations reveal that, through the cultivation and utilization of sub-modules, MoTCoder significantly improves both the modularity and correctness of the generated solutions.
arXiv Detail & Related papers (2023-12-26T08:49:57Z)
Evaluating GPT's Programming Capability through CodeWars' Katas [0.5512295869673147]
This paper presents a novel evaluation of the programming proficiency of Generative Pretrained Transformer (GPT) models. The experiments reveal a distinct boundary at the 3kyu level, beyond which these GPT models struggle to provide solutions. The research emphasizes the need for validation and creative thinking capabilities in AI models to better emulate human problem-solving techniques.
arXiv Detail & Related papers (2023-05-31T10:36:16Z)
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning. During inference, we introduce a new generation procedure with a critical sampling strategy. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.