Related papers: UniCode: A Framework for Generating High Quality Competitive Coding Problems

UniCode: A Framework for Generating High Quality Competitive Coding Problems

URL: http://arxiv.org/abs/2510.17868v1
Date: Thu, 16 Oct 2025 05:07:12 GMT
Title: UniCode: A Framework for Generating High Quality Competitive Coding Problems
Authors: Xinyue Zheng, Haowei Lin, Shaofei Cai, Zilong Zheng, Yitao Liang,
Abstract summary: UniCode is a novel framework that automatically generates high-quality algorithmic problems alongside robust, contamination-resistant test cases.<n>We show that UniCode is highly challenging and discriminative, with the top-performing model, o4-mini, achieving a pass rate of only 70.3%.
Score: 41.66698149759178
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The reliance of competitive coding benchmarks on static, human-authored problems creates significant challenges, including data contamination and limited scalability. To address these issues, we introduce UniCode, a novel framework that automatically generates high-quality algorithmic problems alongside robust, contamination-resistant test cases. Inspired by biological evolution that creates better and diverse offspring, our framework leverages Large Language Models (LLMs) to systematically diversify problems through three strategies: single problem extension, same-type fusion, and cross-type fusion. A key innovation is our stress-driven test case synthesis pipeline, which generates reliable test suites without requiring a canonical ground-truth solution. This pipeline combines brute-force grounding for small-scale inputs with a consensus-based validation mechanism for large-scale inputs to ensure high correctness and coverage. We demonstrate effectiveness of our framework by curating a benchmark of 492 problems and evaluating 19 state-of-the-art LLMs. The results reveal that UniCode is highly challenging and discriminative, with the top-performing model, o4-mini, achieving a pass rate of only 70.3%. Our framework provides a scalable and reliable solution for generating dynamic evaluation datasets in coding domain.

Related papers

Scaling Agentic Verifier for Competitive Coding [66.11758166379092]
Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt.<n>Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling.<n>We propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs.
arXiv Detail & Related papers (2026-02-04T06:30:40Z)
BOSQTGEN: Breaking the Sound Barrier in Test Generation [3.052470294814771]
We introduce BOSQTGEN, a novel black-box and tool for API test generation.<n> BOSQTGEN utilizes a novel approach for decomposing API specifications into primitives, using LLMs to suggest coherent interactions for them, and employing testing to efficiently sample over these values.<n>The resulting BOSQTGEN system achieves an average of 82% of critical code coverage on benchmarks, often a 20% or more increase over prior state-of-the-art systems.
arXiv Detail & Related papers (2025-10-22T17:11:30Z)
QueST: Incentivizing LLMs to Generate Difficult Problems [77.75835742350644]
Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems.<n>Existing competitive coding datasets contain only thousands to tens of thousands of problems.<n>We propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning.
arXiv Detail & Related papers (2025-10-20T16:29:53Z)
An Agentic Framework with LLMs for Solving Complex Vehicle Routing Problems [66.60904891478687]
We propose an Agentic Framework with LLMs (AFL) for solving complex vehicle routing problems.<n>AFL directly extracts knowledge from raw inputs and enables self-contained code generation.<n>We show that AFL substantially outperforms existing LLM-based baselines in both code reliability and solution feasibility.
arXiv Detail & Related papers (2025-10-19T03:59:25Z)
AutoCode: LLMs as Problem Setters for Competitive Programming [94.71566758494787]
We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases.<n>On held-out problems, AutoCode test suites approach 99% consistency with official judgments.
arXiv Detail & Related papers (2025-09-29T17:59:03Z)
AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions [37.21656149034477]
Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs)<n>We argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers.<n>We present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and I CPC.
arXiv Detail & Related papers (2025-08-22T14:04:55Z)
rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset [13.309261291558146]
rStar-Coder is a large-scale, verified dataset of 418K code problems, 580K long-reasoning solutions, and rich test cases of varying difficulty.<n>On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%.
arXiv Detail & Related papers (2025-05-27T15:00:57Z)
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data.<n>It comprises question-solution-test triplets that are systematically validated via a self-verification procedure.<n>This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z)
How to Get Your LLM to Generate Challenging Problems for Evaluation [33.625052642068624]
CHASE is a unified framework to synthetically generate challenging problems using Large Language Models.<n>We implement CHASE to create evaluation benchmarks across three diverse domains.<n>The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy.
arXiv Detail & Related papers (2025-02-20T16:09:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.