UniCode: A Framework for Generating High Quality Competitive Coding Problems
- URL: http://arxiv.org/abs/2510.17868v1
- Date: Thu, 16 Oct 2025 05:07:12 GMT
- Title: UniCode: A Framework for Generating High Quality Competitive Coding Problems
- Authors: Xinyue Zheng, Haowei Lin, Shaofei Cai, Zilong Zheng, Yitao Liang,
- Abstract summary: UniCode is a novel framework that automatically generates high-quality algorithmic problems alongside robust, contamination-resistant test cases.<n>We show that UniCode is highly challenging and discriminative, with the top-performing model, o4-mini, achieving a pass rate of only 70.3%.
- Score: 41.66698149759178
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The reliance of competitive coding benchmarks on static, human-authored problems creates significant challenges, including data contamination and limited scalability. To address these issues, we introduce UniCode, a novel framework that automatically generates high-quality algorithmic problems alongside robust, contamination-resistant test cases. Inspired by biological evolution that creates better and diverse offspring, our framework leverages Large Language Models (LLMs) to systematically diversify problems through three strategies: single problem extension, same-type fusion, and cross-type fusion. A key innovation is our stress-driven test case synthesis pipeline, which generates reliable test suites without requiring a canonical ground-truth solution. This pipeline combines brute-force grounding for small-scale inputs with a consensus-based validation mechanism for large-scale inputs to ensure high correctness and coverage. We demonstrate effectiveness of our framework by curating a benchmark of 492 problems and evaluating 19 state-of-the-art LLMs. The results reveal that UniCode is highly challenging and discriminative, with the top-performing model, o4-mini, achieving a pass rate of only 70.3%. Our framework provides a scalable and reliable solution for generating dynamic evaluation datasets in coding domain.
Related papers
- Scaling Agentic Verifier for Competitive Coding [66.11758166379092]
Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt.<n>Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling.<n>We propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs.
arXiv Detail & Related papers (2026-02-04T06:30:40Z) - BOSQTGEN: Breaking the Sound Barrier in Test Generation [3.052470294814771]
We introduce BOSQTGEN, a novel black-box and tool for API test generation.<n> BOSQTGEN utilizes a novel approach for decomposing API specifications into primitives, using LLMs to suggest coherent interactions for them, and employing testing to efficiently sample over these values.<n>The resulting BOSQTGEN system achieves an average of 82% of critical code coverage on benchmarks, often a 20% or more increase over prior state-of-the-art systems.
arXiv Detail & Related papers (2025-10-22T17:11:30Z) - QueST: Incentivizing LLMs to Generate Difficult Problems [77.75835742350644]
Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems.<n>Existing competitive coding datasets contain only thousands to tens of thousands of problems.<n>We propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning.
arXiv Detail & Related papers (2025-10-20T16:29:53Z) - An Agentic Framework with LLMs for Solving Complex Vehicle Routing Problems [66.60904891478687]
We propose an Agentic Framework with LLMs (AFL) for solving complex vehicle routing problems.<n>AFL directly extracts knowledge from raw inputs and enables self-contained code generation.<n>We show that AFL substantially outperforms existing LLM-based baselines in both code reliability and solution feasibility.
arXiv Detail & Related papers (2025-10-19T03:59:25Z) - AutoCode: LLMs as Problem Setters for Competitive Programming [94.71566758494787]
We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases.<n>On held-out problems, AutoCode test suites approach 99% consistency with official judgments.
arXiv Detail & Related papers (2025-09-29T17:59:03Z) - AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions [37.21656149034477]
Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs)<n>We argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers.<n>We present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and I CPC.
arXiv Detail & Related papers (2025-08-22T14:04:55Z) - rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset [13.309261291558146]
rStar-Coder is a large-scale, verified dataset of 418K code problems, 580K long-reasoning solutions, and rich test cases of varying difficulty.<n>On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%.
arXiv Detail & Related papers (2025-05-27T15:00:57Z) - KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data.<n>It comprises question-solution-test triplets that are systematically validated via a self-verification procedure.<n>This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z) - How to Get Your LLM to Generate Challenging Problems for Evaluation [33.625052642068624]
CHASE is a unified framework to synthetically generate challenging problems using Large Language Models.<n>We implement CHASE to create evaluation benchmarks across three diverse domains.<n>The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy.
arXiv Detail & Related papers (2025-02-20T16:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.