Patched RTC: evaluating LLMs for diverse software development tasks
- URL: http://arxiv.org/abs/2407.16557v1
- Date: Tue, 23 Jul 2024 15:12:14 GMT
- Title: Patched RTC: evaluating LLMs for diverse software development tasks
- Authors: Asankhaya Sharma,
- Abstract summary: This paper introduces Patched Round-Trip Correctness (Patched RTC), a novel evaluation technique for Large Language Models (LLMs)
Patched RTC offers a self-evaluating framework that measures consistency and robustness of model responses without human intervention.
Experiments comparing GPT-3.5 and GPT-4 models across different software development tasks reveal that Patched RTC effectively distinguishes model performance and task difficulty.
- Score: 1.14219428942199
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces Patched Round-Trip Correctness (Patched RTC), a novel evaluation technique for Large Language Models (LLMs) applied to diverse software development tasks, particularly focusing on "outer loop" activities such as bug fixing, code review, and documentation updates. Patched RTC extends the original Round-Trip Correctness method to work with any LLM and downstream task, offering a self-evaluating framework that measures consistency and robustness of model responses without human intervention. The study demonstrates a correlation between Patched RTC scores and task-specific accuracy metrics, presenting it as an alternative to the LLM-as-Judge paradigm for open-domain task evaluation. We implement Patched RTC in an open-source framework called patchwork, allowing for transparent evaluation during inference across various patchflows. Experiments comparing GPT-3.5 and GPT-4 models across different software development tasks reveal that Patched RTC effectively distinguishes model performance and task difficulty. The paper also explores the impact of consistency prompts on improving model accuracy, suggesting that Patched RTC can guide prompt refinement and model selection for complex software development workflows.
Related papers
- Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.
Our framework incorporates two complementary strategies: internal TTC and external TTC.
We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models.
Controlled Decoding provides a mechanism for aligning a model at inference time without retraining.
We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z) - AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification [31.463529258956452]
We present AURORA, a novel framework for training universal process reward models (PRMs) using ensemble prompting and reverse verification.
The framework employs a two-phase approach: First, it uses diverse prompting strategies and ensemble methods to perform automated annotation and evaluation of processes.
To assess the framework's performance, we extend beyond the existing ProcessBench benchmark by introducing UniversalBench.
arXiv Detail & Related papers (2025-02-17T07:41:27Z) - Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework [81.29965270493238]
We develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) for wireless communication applications.
The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard.
We introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data.
arXiv Detail & Related papers (2025-01-16T16:19:53Z) - Code Review Automation Via Multi-task Federated LLM -- An Empirical Study [4.8342038441006805]
The study explores five simple techniques for multi-task training, including two sequential methods, one parallel method, and two cumulative methods.
The results indicate that sequentially training a federated LLM (FedLLM) for our code review multi-task use case is less efficient in terms of time, computation, and performance metrics, compared to training separate models for each task.
arXiv Detail & Related papers (2024-12-20T08:46:46Z) - Model Contribution Rate Theory: An Empirical Examination [0.0]
The paper presents a systematic methodology for analyzing software developer productivity by refining contribution rate metrics to distinguish meaningful development efforts from anomalies.
The findings provide actionable insights for optimizing team performance and workflow management in modern software engineering practices.
arXiv Detail & Related papers (2024-12-08T15:56:23Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions [62.0123588983514]
Large Language Models (LLMs) have demonstrated wide-ranging applications across various fields.
We reformulate the peer-review process as a multi-turn, long-context dialogue, incorporating distinct roles for authors, reviewers, and decision makers.
We construct a comprehensive dataset containing over 26,841 papers with 92,017 reviews collected from multiple sources.
arXiv Detail & Related papers (2024-06-09T08:24:17Z) - Heuristic Algorithm-based Action Masking Reinforcement Learning (HAAM-RL) with Ensemble Inference Method [0.0]
This paper presents a novel reinforcement learning approach called HAAMRL (Heuristic ensemble-based Action Masking Reinforcement Learning)
The proposed approach exhibits superior performance and capability generalization, indicating superior effectiveness in optimizing complex manufacturing processes.
arXiv Detail & Related papers (2024-03-21T03:42:39Z) - Unsupervised Evaluation of Code LLMs with Round-Trip Correctness [25.557158930295465]
We introduce round-trip correctness (RTC) as an alternative evaluation method.
RTC rests on the idea that we can ask a model to make a prediction.
We show how to employ RTC to evaluate code synthesis and editing.
arXiv Detail & Related papers (2024-02-13T11:08:08Z) - A Thorough Examination of Decoding Methods in the Era of LLMs [72.65956436513241]
Decoding methods play an indispensable role in converting language models from next-token predictors into practical task solvers.
This paper provides a comprehensive and multifaceted analysis of various decoding methods within the context of large language models.
Our findings reveal that decoding method performance is notably task-dependent and influenced by factors such as alignment, model size, and quantization.
arXiv Detail & Related papers (2024-02-10T11:14:53Z) - Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data.
For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z) - Assessing and Improving Syntactic Adversarial Robustness of Pre-trained
Models for Code Translation [19.186392871168064]
CoTR consists of two components: CoTR-A and CoTR-D.
The effectiveness of CoTR is evaluated through experiments on real world Java to Python datasets.
arXiv Detail & Related papers (2023-10-28T04:35:24Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z) - Benchmarking Test-Time Adaptation against Distribution Shifts in Image
Classification [77.0114672086012]
Test-time adaptation (TTA) is a technique aimed at enhancing the generalization performance of models by leveraging unlabeled samples solely during prediction.
We present a benchmark that systematically evaluates 13 prominent TTA methods and their variants on five widely used image classification datasets.
arXiv Detail & Related papers (2023-07-06T16:59:53Z) - Balancing Average and Worst-case Accuracy in Multitask Learning [39.59582055620513]
We show how to use techniques from the distributionally robust optimization (DRO) literature to improve worst-case performance in multitask learning.
We present an improved method, Lookahead-DRO (L-DRO), which mitigates these issues.
Our empirical results show that L-DRO achieves a better trade-off between average and worst-case accuracy with little computational overhead.
arXiv Detail & Related papers (2021-10-12T09:00:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.