Related papers: CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

URL: http://arxiv.org/abs/2401.03065v1
Date: Fri, 5 Jan 2024 20:53:51 GMT
Title: CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Authors: Alex Gu, Baptiste Rozi\`ere, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida I. Wang
Abstract summary: We present a benchmark consisting of 800 Python functions (3-13 lines) Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. We show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it.
Score: 36.30158138035512
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval do not show the same improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction, highlighting the gap between open and closed source models. As no model is close to acing CRUXEval, we provide examples of consistent GPT-4 failures on simple programs as a lens into its code reasoning capabilities and areas for improvement.

Related papers

CodeReasoner: Enhancing the Code Reasoning Ability with Reinforcement Learning [8.197518276987989]
Code reasoning is a fundamental capability for large language models (LLMs) in the code domain.<n>Prior approaches mainly rely on supervised fine-tuning to improve performance in code reasoning tasks.<n>We argue this is due to two core issues: the low quality of training data and the limitations of supervised fine-tuning.<n>We propose CodeReasoner, a framework that spans both dataset construction and a two-stage training process.
arXiv Detail & Related papers (2025-07-23T14:26:58Z)
Value-Guided Search for Efficient Chain-of-Thought Reasoning [43.99559903458839]
We train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling.<n>With an inference budget of 64 generations, VGS with DeepSeek-R1-Distill-1.5B achieves an average accuracy of 45.7% across four competition math benchmarks.
arXiv Detail & Related papers (2025-05-23T01:05:07Z)
Output Scaling: YingLong-Delayed Chain of Thought in a Large Pretrained Time Series Forecasting Model [55.25659103706409]
This framework achieves state-of-the-art performance for our designed foundation model, YingLong.<n>YingLong is a non-causal, bidirectional attention encoder-only transformer trained through masked token recovery.<n>We release four foundation models ranging from 6M to 300M parameters, demonstrating superior results in zero-shot tasks.
arXiv Detail & Related papers (2025-05-20T14:31:06Z)
S*: Test Time Scaling for Code Generation [55.11863577956177]
We propose S*, the first hybrid test-time scaling framework for code generation. S* substantially improves the coverage and selection accuracy of generated code.
arXiv Detail & Related papers (2025-02-20T09:18:53Z)
EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [54.354203142828084]
We present the task of equivalence checking as a new way to evaluate the code reasoning abilities of large language models. We introduce EquiBench, a dataset of 2400 program pairs spanning four programming languages and six equivalence categories. Our evaluation of 17 state-of-the-art LLMs shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference [0.0]
We present Entropy Adaptive Decoding (EAD), a novel approach for efficient language model inference. EAD switches between different-sized models based on prediction uncertainty. We show remarkable efficiency gains across different model families.
arXiv Detail & Related papers (2025-02-05T22:15:21Z)
Preference Optimization for Reasoning with Pseudo Feedback [100.62603571434167]
We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
arXiv Detail & Related papers (2024-11-25T12:44:02Z)
QLoRA: Efficient Finetuning of Quantized LLMs [66.58009990713134]
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU. QLoRA backpropagates through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters(LoRA) Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark.
arXiv Detail & Related papers (2023-05-23T17:50:33Z)
GROOT: Corrective Reward Optimization for Generative Sequential Labeling [10.306943706927004]
We propose GROOT -- a framework for Generative Reward Optimization Of Text sequences. GROOT works by training a generative sequential labeling model to match the decoder output distribution with that of the (black-box) reward function. As demonstrated via extensive experiments on four public benchmarks, GROOT significantly improves all reward metrics.
arXiv Detail & Related papers (2022-09-29T11:35:47Z)
When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable. In order to achieve a better accuracy, we propose two lightweight modules. DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
Out-of-Vocabulary Entities in Link Prediction [1.9036571490366496]
Link prediction is often used as a proxy to evaluate the quality of embeddings. As benchmarks are crucial for the fair comparison of algorithms, ensuring their quality is tantamount to providing a solid ground for developing better solutions. We provide an implementation of an approach for spotting and removing such entities and provide corrected versions of the datasets WN18RR, FB15K-237, and YAGO3-10.
arXiv Detail & Related papers (2021-05-26T12:58:18Z)
Towards More Fine-grained and Reliable NLP Performance Prediction [85.78131503006193]
We make two contributions to improving performance prediction for NLP tasks. First, we examine performance predictors for holistic measures of accuracy like F1 or BLEU. Second, we propose methods to understand the reliability of a performance prediction model from two angles: confidence intervals and calibration.
arXiv Detail & Related papers (2021-02-10T15:23:20Z)
Heuristic Semi-Supervised Learning for Graph Generation Inspired by Electoral College [80.67842220664231]
We propose a novel pre-processing technique, namely ELectoral COllege (ELCO), which automatically expands new nodes and edges to refine the label similarity within a dense subgraph. In all setups tested, our method boosts the average score of base models by a large margin of 4.7 points, as well as consistently outperforms the state-of-the-art.
arXiv Detail & Related papers (2020-06-10T14:48:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.