Related papers: Generating Unseen Code Tests In Infinitum

Related papers

Program Semantic Inequivalence Game with Large Language Models [10.358176296850639]
Large Language Models (LLMs) can achieve strong performance on everyday coding tasks, but they can fail on complex tasks that require non-trivial reasoning about program semantics.<n>In this work, we explore a method to synthetically generate code reasoning training data based on a semantic inequivalence game SInQ.<n>We prove that this setup enables theoretically unlimited improvement through self-play in the limit of infinite computational resources.
arXiv Detail & Related papers (2025-05-02T20:03:35Z)
Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks [18.4734091090676]
We investigate whether replacing human-written comments with large language models improves pre-training datasets. Results show that LLM-generated comments are more semantically consistent with code than human-written ones.
arXiv Detail & Related papers (2025-04-28T03:16:34Z)
Beyond Memorization: Evaluating the True Type Inference Capabilities of LLMs for Java Code Snippets [3.152174935904172]
Recent studies have leveraged Large Language Models for type inference on code snippets, showing promising results. However, these results are potentially affected by data leakage, as the benchmark suite (StatType-SO) has been public on GitHub since 2017. We conducted a three-pronged evaluation to comprehensively assess LLMs' type inference capabilities on Java code snippets.
arXiv Detail & Related papers (2025-03-06T04:13:40Z)
Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options. Our method is able to work under black-box conditions without access to model training data or weights. We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z)
DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation [48.11754113512047]
This study includes a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains. Our pipeline works in a fully automated manner, enabling a push-bottom construction from code repositories into formatted subjects under study. The contributions of this study include a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains, a fully automated pipeline for constructing code benchmarks, and an identification of the limitations of LLMs in code generation tasks based on their performance on DOMAINEVAL.
arXiv Detail & Related papers (2024-08-23T16:33:58Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications. We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines. We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [10.730852617039451]
We investigate the capability of LLM-based Code Agents to formalize user issues into test cases. We propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests. We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed for test generation.
arXiv Detail & Related papers (2024-06-18T14:54:37Z)
TESTEVAL: Benchmarking Large Language Models for Test Case Generation [15.343859279282848]
We propose TESTEVAL, a novel benchmark for test case generation with large language models (LLMs) We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs.
arXiv Detail & Related papers (2024-06-06T22:07:50Z)
CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z)
Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM [13.324171480106715]
EvoEval is a program synthesis benchmark suite created by evolving existing benchmarks into different targeted domains. Our study shows that compared to the high performance obtained on standard benchmarks like HumanEval, there is a significant drop in performance. We showcase various insights, including the brittleness of instruction-following models when encountering rewording or subtle changes.
arXiv Detail & Related papers (2024-03-28T03:10:39Z)
RLTF: Reinforcement Learning from Unit Test Feedback [17.35361167578498]
Reinforcement Learning from Unit Test Feedback is a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs. Our approach generates data in real-time during training and simultaneously utilizes fine-grained feedback signals to guide the model towards producing higher-quality code.
arXiv Detail & Related papers (2023-07-10T05:18:18Z)
Coarse-Tuning Models of Code with Reinforcement Learning Feedback [0.0]
Large Language Models (LLMs) pre-trained on code have emerged as the dominant approach to program synthesis. We propose RLCF, that further trains a pre-trained LLM via reinforcement learning, using feedback from a grounding function that scores the quality of the code.
arXiv Detail & Related papers (2023-05-25T22:09:08Z)
LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback. Our focus is the code generation task, where the model produces code based on natural language instructions. LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)
T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available. We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone. T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.