Related papers: Vibe Checker: Aligning Code Evaluation with Human Preference

Vibe Checker: Aligning Code Evaluation with Human Preference

URL: http://arxiv.org/abs/2510.07315v1
Date: Wed, 08 Oct 2025 17:59:19 GMT
Title: Vibe Checker: Aligning Code Evaluation with Human Preference
Authors: Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun,
Abstract summary: We present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers.<n>We show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression.<n>Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models.
Score: 35.939058895669895
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking the non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check that represents human preference in coding besides functional correctness. To quantify models' code instruction following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in Vibe Checker, a testbed to assess both code instruction following and functional correctness. Upon evaluating 31 leading LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression. Most importantly, a composite score of functional correctness and instruction following correlates the best with human preference, with the latter emerging as the primary differentiator on real-world programming tasks. Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding.

Related papers

CodeCircuit: Toward Inferring LLM-Generated Code Correctness via Attribution Graphs [13.488544043942495]
We aim to investigate whether the model's neural dynamics encode internally decodable signals that are predictive of logical validity during code generation.<n>By decomposing complex residual flows, we aim to identify the structural signatures that distinguish sound reasoning from logical failure.<n>Analysis across Python, C++, and Java confirms that intrinsic correctness signals are robust across diverse syntaxes.
arXiv Detail & Related papers (2026-02-06T03:49:15Z)
Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation [11.445615378917578]
PACT is a program assessment and contract-adherence evaluation framework.<n>It provides a comprehensive test-suite corpus focused on contract violations.<n>It enables a systematic analysis of code generation under varied prompting conditions.
arXiv Detail & Related papers (2025-10-14T01:12:37Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
SolBench: A Dataset and Benchmark for Evaluating Functional Correctness in Solidity Code Completion and Repair [51.0686873716938]
We introduce SolBench, a benchmark for evaluating the functional correctness of Solidity smart contracts generated by code completion models.<n>We propose a Retrieval-Augmented Code Repair framework to verify functional correctness of smart contracts.<n>Results show that code repair and retrieval techniques effectively enhance the correctness of smart contract completion while reducing computational costs.
arXiv Detail & Related papers (2025-03-03T01:55:20Z)
Learning Code Preference via Synthetic Evolution [20.897742297490275]
We propose CodeFavor, a framework for training pairwise code preference models from synthetic evolution data. Our evaluation shows that CodeFavor holistically improves the accuracy of model-based code preferences by up to 28.8%. CodeFavor models can match the performance of models with 6-9x more parameters while being 34x more cost-effective.
arXiv Detail & Related papers (2024-10-04T18:05:22Z)
On the Limitations of Embedding Based Methods for Measuring Functional Correctness for Code Generation [4.065344017083881]
We analyze the ability of embedding-based metrics like CodeBERTScore to measure functional correctness and other helpful constructs like editing effort. Our results show that while they have a weak correlation with functional correctness (0.16), they are strongly correlated (0.72) with editing effort.
arXiv Detail & Related papers (2024-04-26T15:54:39Z)
NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness [10.502272765892908]
Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. Our finding is that LMs generally falter when tested on our benchmark, hinting at fundamental blindspots in their training setups.
arXiv Detail & Related papers (2024-01-29T08:47:31Z)
Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study [4.438873396405334]
We aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion. For comments, we find that the models perform better in the presence of multi-line comments.
arXiv Detail & Related papers (2023-04-24T17:09:14Z)
ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z)
Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent. It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics. We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z)
CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations. For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.