Related papers: RustEvo^2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation

RustEvo^2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation

URL: http://arxiv.org/abs/2503.16922v1
Date: Fri, 21 Mar 2025 07:33:59 GMT
Title: RustEvo^2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation
Authors: Linxi Liang, Jing Gong, Mingwei Liu, Chong Wang, Guangsheng Ou, Yanlin Wang, Xin Peng, Zibin Zheng,
Abstract summary: RustEvo is a framework for evaluating the ability of large language models to adapt to evolving Rust APIs.<n>It automates dataset creation by synthesizing 588 API changes into programming tasks mirroring real-world challenges.<n>Experiments on state-of-the-art (SOTA) LLMs reveal significant performance variations.
Score: 28.156862491709237
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Large Language Models (LLMs) have become pivotal tools for automating code generation in software development. However, these models face significant challenges in producing version-aware code for rapidly evolving languages like Rust, where frequent Application Programming Interfaces (API) changes across versions lead to compatibility issues and correctness errors. Existing benchmarks lack systematic evaluation of how models navigate API transitions, relying on labor-intensive manual curation and offering limited version-specific insights. To address this gap, we present RustEvo, a novel framework for constructing dynamic benchmarks that evaluate the ability of LLMs to adapt to evolving Rust APIs. RustEvo automates dataset creation by synthesizing 588 API changes (380 from Rust standard libraries, 208 from 15 third-party crates) into programming tasks mirroring real-world challenges. These tasks cover four API evolution categories: Stabilizations, Signature Changes, Behavioral Changes, and Deprecations, reflecting their actual distribution in the Rust ecosystem. Experiments on state-of-the-art (SOTA) LLMs reveal significant performance variations: models achieve a 65.8% average success rate on stabilized APIs but only 38.0% on behavioral changes, highlighting difficulties in detecting semantic shifts without signature alterations. Knowledge cutoff dates strongly influence performance, with models scoring 56.1% on before-cutoff APIs versus 32.5% on after-cutoff tasks. Retrieval-Augmented Generation (RAG) mitigates this gap, improving success rates by 13.5% on average for APIs released after model training. Our findings underscore the necessity of our evolution-aware benchmarks to advance the adaptability of LLMs in fast-paced software ecosystems. The framework and the benchmarks are publicly released at https://github.com/SYSUSELab/RustEvo.

Related papers

EVOC2RUST: A Skeleton-guided Framework for Project-Level C-to-Rust Translation [16.12483934561206]
EvoC2Rust is an automated framework for converting entire C projects to equivalent Rust ones.<n>Our evaluation on open-source benchmarks and six industrial projects demonstrates EvoC2Rust's superior performance in project-level C-to-Rust translation.
arXiv Detail & Related papers (2025-08-06T10:31:23Z)
Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering [51.7496756448709]
Language models (LMs) perform well on coding benchmarks but struggle with real-world software engineering tasks.<n>Existing approaches rely on supervised fine-tuning with high-quality data, which is expensive to curate at scale.<n>We propose Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process.
arXiv Detail & Related papers (2025-05-29T16:15:36Z)
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test [25.703729145091483]
A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs.<n>We observe that scaling up data provides limited improvements for the Eagle program.<n>We introduce Eagle-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion.
arXiv Detail & Related papers (2025-03-03T18:59:04Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario [72.02391485962127]
Speech Self-Supervised Learning (SSL) models achieve impressive performance on Automatic Speech Recognition (ASR) In low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. We extend a conventional efficient fine-tuning scheme based on the adapter to handle these issues.
arXiv Detail & Related papers (2024-11-27T10:51:00Z)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z)
Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation [6.463959200930805]
We evaluate new commercial and open models since the release of the open-source VerilogEval benchmark.<n>We find measurable improvements in state-of-the-art models.<n>We find that prompt engineering remains crucial for achieving good pass rates.
arXiv Detail & Related papers (2024-08-20T17:58:56Z)
Mitigating the Impact of Malware Evolution on API Sequence-based Windows Malware Detector [5.953199557879621]
Methods based on API sequences play a crucial role in malware prevention. Evolved malware samples often use the API sequences of the pre-evolution samples to achieve similar malicious behaviors. We propose a frame(MME) framework that can enhance existing API sequence-based malware detectors.
arXiv Detail & Related papers (2024-08-03T04:21:24Z)
Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models [84.8919069953397]
Self-TAught Recognizer (STAR) is an unsupervised adaptation framework for speech recognition systems. We show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains. STAR exhibits high data efficiency that only requires less than one-hour unlabeled data.
arXiv Detail & Related papers (2024-05-23T04:27:11Z)
RoBERTa-Augmented Synthesis for Detecting Malicious API Requests [9.035212370386846]
We introduce a GAN-inspired learning framework that extends limited API traffic datasets through targeted, domain-aware synthesis.<n>We evaluate our framework on two benchmark datasets, CSIC 2010 and ATRDF 2023, and compare it with a previous data augmentation technique.<n>Our method achieves up to a 4.94% increase in F1 score on CSIC 2010 and up to 21.10% on ATRDF 2023.
arXiv Detail & Related papers (2024-05-18T11:10:45Z)
Octopus: On-device language model for function calling of software APIs [9.78611123915888]
Large Language Models (LLMs) play a crucial role due to their advanced text processing and generation abilities. This study introduces a new strategy aimed at harnessing on-device LLMs in invoking software APIs.
arXiv Detail & Related papers (2024-04-02T01:29:28Z)
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models [74.88844320554284]
We introduce StableToolBench, a benchmark evolving from ToolBench.<n>The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status.<n>The stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation.
arXiv Detail & Related papers (2024-03-12T14:57:40Z)
GEVO-ML: Optimizing Machine Learning Code with Evolutionary Computation [6.525197444717069]
GEVO-ML is a tool for discovering optimization opportunities and tuning the performance of Machine Learning kernels. We demonstrate GEVO-ML on two different ML workloads for both model training and prediction. GEVO-ML finds significant improvements for these models, achieving 90.43% performance improvement when model accuracy is relaxed by 2%.
arXiv Detail & Related papers (2023-10-16T09:24:20Z)
Multi-Granularity Detector for Vulnerability Fixes [13.653249890867222]
We propose MiDas (Multi-Granularity Detector for Vulnerability Fixes) to identify vulnerability-fixing commits. MiDas constructs different neural networks for each level of code change granularity, corresponding to commit-level, file-level, hunk-level, and line-level. MiDas outperforms the current state-of-the-art baseline in terms of AUC by 4.9% and 13.7% on Java and Python-based datasets.
arXiv Detail & Related papers (2023-05-23T10:06:28Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.