Related papers: CPP-UT-Bench: Can LLMs Write Complex Unit Tests in C++?

CPP-UT-Bench: Can LLMs Write Complex Unit Tests in C++?

URL: http://arxiv.org/abs/2412.02735v1
Date: Tue, 03 Dec 2024 18:35:24 GMT
Title: CPP-UT-Bench: Can LLMs Write Complex Unit Tests in C++?
Authors: Vaishnavi Bhargava, Rajat Ghosh, Debojyoti Dutta,
Abstract summary: CPP-UT-Bench is a benchmark dataset to measure C++ unit test generation capability of a large language model (LLM)<n>The dataset includes 2,653 code, unit test pairs drawn from 14 different opensource C++s.
Score: 0.4915744683251149
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce CPP-UT-Bench, a benchmark dataset to measure C++ unit test generation capability of a large language model (LLM). CPP-UT-Bench aims to reflect a broad and diverse set of C++ codebases found in the real world. The dataset includes 2,653 {code, unit test} pairs drawn from 14 different opensource C++ codebases spanned across nine diverse domains including machine learning, software testing, parsing, standard input-output, data engineering, logging, complete expression evaluation, key value storage, and server protocols. We demonstrated the effectiveness of CPP-UT-Bench as a benchmark dataset through extensive experiments in in-context learning, parameter-efficient fine-tuning (PEFT), and full-parameter fine-tuning. We also discussed the challenges of the dataset compilation and insights we learned from in-context learning and fine-tuning experiments. Besides the CPP-UT-Bench dataset and data compilation code, we are also offering the fine-tuned model weights for further research. For nine out of ten experiments, our fine-tuned LLMs outperformed the corresponding base models by an average of more than 70%.

Related papers

CLIMB: Class-imbalanced Learning Benchmark on Tabular Data [68.07599497425267]
Class-imbalanced learning (CIL) is important in many real-world applications where the minority class holds the critical but rare outcomes.<n>In this paper, we present CLIMB, a comprehensive benchmark for class-imbalanced learning.<n> CLIMB includes 73 real-world datasets across diverse domains and imbalance levels, along with unified implementations of 29 representative CIL algorithms.
arXiv Detail & Related papers (2025-05-23T04:21:03Z)
A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs [3.458772578520879]
We introduce a large-scale, Python class-level dataset curated from $13,174$ real-world open-source projects. The dataset contains over 842,000 class skeletons, each including class and method signatures, along with associated docstrings when available. We use extracted class skeletons as prompts for GPT-4 to generate full class implementations. Results show that the LLM-generated classes exhibit strong lexical and structural similarity to human-written counterparts, with average ROUGE@L, BLEU, and TSED scores of 0.80, 0.59, and 0.73, respectively.
arXiv Detail & Related papers (2025-04-22T03:33:57Z)
OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments. We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
CITYWALK: Enhancing LLM-Based C++ Unit Test Generation via Project-Dependency Awareness and Language-Specific Knowledge [13.592814106490724]
CITYWALK is a novel framework for C++ unit test generation. It provides a comprehensive understanding of the dependency relationships within the project under test via program analysis. It incorporates language-specific knowledge about C++ derived from project documentation and empirical observations.
arXiv Detail & Related papers (2025-01-27T15:49:24Z)
A Large Language Model Approach to Identify Flakiness in C++ Projects [3.549578374095042]
Flaky tests introduce non-deterministic behaviour and undermine the reliability of regression testing results. We propose an LLM-based approach for identifying the root cause of flaky tests in C++ projects at the code level. We fine-tune Mistral-7b, Llama2-7b and CodeLlama-7b models on the C++ dataset and an existing Java dataset and evaluate the performance in terms of precision, recall, accuracy, and F1 score.
arXiv Detail & Related papers (2024-12-16T20:20:45Z)
MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models [66.64809260956312]
We propose a multi-granularity tool-use benchmark for large language models called MTU-Bench. Our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios. Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench.
arXiv Detail & Related papers (2024-10-15T15:46:17Z)
Creating a Dataset for High-Performance Computing Code Translation using LLMs: A Bridge Between OpenMP Fortran and C++ [7.872005563259838]
The effectiveness of our dataset is assessed using both quantitative (CodeBLEU) and qualitative (human evaluation) methods. Models without prior coding knowledge experienced a boost of $mathbftimes5.1$ in CodeBLEU scores. Models with some coding familiarity saw an impressive $mathbftimes9.9$-fold increase.
arXiv Detail & Related papers (2023-07-15T02:35:51Z)
LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback. Our focus is the code generation task, where the model produces code based on natural language instructions. LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z)
DataComp: In search of the next generation of multimodal datasets [179.79323076587255]
DataComp is a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Our benchmark consists of multiple compute scales spanning four orders of magnitude. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet.
arXiv Detail & Related papers (2023-04-27T11:37:18Z)
The Stack: 3 TB of permissively licensed source code [22.522188673911792]
The Stack is a dataset of permissively licensed source code in 30 programming languages. It is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data.
arXiv Detail & Related papers (2022-11-20T18:15:30Z)
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.