CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions
- URL: http://arxiv.org/abs/2602.20213v1
- Date: Mon, 23 Feb 2026 05:59:30 GMT
- Title: CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions
- Authors: Jingwei Shi, Xinxiang Yin, Jing Huang, Jinman Zhao, Shengyu Tao,
- Abstract summary: Existing benchmarks often lack coverage for subtle corner cases, allowing incorrect solutions to pass.<n>CodeHacker generates adversarial test cases that expose latent vulnerabilities in program submissions.<n>Experiments demonstrate that CodeHacker significantly improves the True Negative Rate (TNR) of existing datasets.
- Score: 8.163435280190027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The evaluation of Large Language Models (LLMs) for code generation relies heavily on the quality and robustness of test cases. However, existing benchmarks often lack coverage for subtle corner cases, allowing incorrect solutions to pass. To bridge this gap, we propose CodeHacker, an automated agent framework dedicated to generating targeted adversarial test cases that expose latent vulnerabilities in program submissions. Mimicking the hack mechanism in competitive programming, CodeHacker employs a multi-strategy approach, including stress testing, anti-hash attacks, and logic-specific targeting to break specific code submissions. To ensure the validity and reliability of these attacks, we introduce a Calibration Phase, where the agent iteratively refines its own Validator and Checker via self-generated adversarial probes before evaluating contestant code.Experiments demonstrate that CodeHacker significantly improves the True Negative Rate (TNR) of existing datasets, effectively filtering out incorrect solutions that were previously accepted. Furthermore, generated adversarial cases prove to be superior training data, boosting the performance of RL-trained models on benchmarks like LiveCodeBench.
Related papers
- Scaling Agentic Verifier for Competitive Coding [66.11758166379092]
Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt.<n>Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling.<n>We propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs.
arXiv Detail & Related papers (2026-02-04T06:30:40Z) - ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z) - Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning [43.30900834053253]
We present Klear-CodeTest, a comprehensive test case synthesis framework featuring rigorous verification to ensure quality and reliability of test cases.<n>The proposed G-V framework generates comprehensive test cases including both regular and corner cases, enhancing test coverage and discriminative power for solution correctness assessment.<n>In addition, we design a multi-layered security sandbox system optimized for online verification platforms, guaranteeing safe and reliable code execution.
arXiv Detail & Related papers (2025-08-07T07:36:01Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - LogiCase: Effective Test Case Generation from Logical Description in Competitive Programming [7.726159311658054]
We introduce Context-Free Grammars with Counters (CCFGs), a formalism that captures both syntactic and semantic structures in input specifications.<n>Using a fine-tuned CodeT5 model, we translate natural language input specifications into CCFGs, enabling the systematic generation of high-quality test cases.
arXiv Detail & Related papers (2025-05-21T02:48:01Z) - CodeDPO: Aligning Code Models with Self Generated and Verified Source Code [52.70310361822519]
We propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency.<n>CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases.
arXiv Detail & Related papers (2024-10-08T01:36:15Z) - SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [10.730852617039451]
We investigate the capability of LLM-based Code Agents to formalize user issues into test cases.<n>We propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests.<n>We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed for test generation.
arXiv Detail & Related papers (2024-06-18T14:54:37Z) - An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection [17.948513691133037]
We introduce CodeBreaker, a pioneering LLM-assisted backdoor attack framework on code completion models.
By integrating malicious payloads directly into the source code with minimal transformation, CodeBreaker challenges current security measures.
arXiv Detail & Related papers (2024-06-10T22:10:05Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - Certified Robustness to Label-Flipping Attacks via Randomized Smoothing [105.91827623768724]
Machine learning algorithms are susceptible to data poisoning attacks.
We present a unifying view of randomized smoothing over arbitrary functions.
We propose a new strategy for building classifiers that are pointwise-certifiably robust to general data poisoning attacks.
arXiv Detail & Related papers (2020-02-07T21:28:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.