Related papers: LogiCase: Effective Test Case Generation from Logical Description in Competitive Programming

LogiCase: Effective Test Case Generation from Logical Description in Competitive Programming

URL: http://arxiv.org/abs/2505.15039v1
Date: Wed, 21 May 2025 02:48:01 GMT
Title: LogiCase: Effective Test Case Generation from Logical Description in Competitive Programming
Authors: Sicheol Sung, Aditi, Dogyu kim, Yo-Sub Han, Sang-Ki Ko,
Abstract summary: We introduce Context-Free Grammars with Counters (CCFGs), a formalism that captures both syntactic and semantic structures in input specifications.<n>Using a fine-tuned CodeT5 model, we translate natural language input specifications into CCFGs, enabling the systematic generation of high-quality test cases.
Score: 7.726159311658054
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Automated Test Case Generation (ATCG) is crucial for evaluating software reliability, particularly in competitive programming where robust algorithm assessments depend on diverse and accurate test cases. However, existing ATCG methods often fail to meet complex specifications or generate effective corner cases, limiting their utility. In this work, we introduce Context-Free Grammars with Counters (CCFGs), a formalism that captures both syntactic and semantic structures in input specifications. Using a fine-tuned CodeT5 model, we translate natural language input specifications into CCFGs, enabling the systematic generation of high-quality test cases. Experiments on the CodeContests dataset demonstrate that CCFG-based test cases outperform baseline methods in identifying incorrect algorithms, achieving significant gains in validity and effectiveness. Our approach provides a scalable and reliable grammar-driven framework for enhancing automated competitive programming evaluations.

Related papers

Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs [84.30534714651093]
We present an innovative APR tool for Dafny, a verification-aware programming language.<n>We localize faults through a series of steps, which include using Hoare Logic to determine the state of each statement within the program.<n>We evaluate our approach using DafnyBench, a benchmark of real-world Dafny programs.
arXiv Detail & Related papers (2025-07-04T15:36:12Z)
SAGE:Specification-Aware Grammar Extraction for Automated Test Case Generation with LLMs [8.576669309128382]
We explore the use of open-source large language models (LLMs) to induce Context-Free Grammars with Counters (CCFGs)<n>CCFGs represent specifications with logical constraints by storing and reusing counter values during derivation.<n>Our approach achieves stronger generalization and outperforms 17 open and closed-source LLMs in both grammar quality and test effectiveness.
arXiv Detail & Related papers (2025-06-04T08:44:25Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking [55.81461218284736]
EquiBench is a new benchmark for evaluating large language models (LLMs)<n>It determines whether two programs produce identical outputs for all possible inputs.<n>We evaluate 19 state-of-the-art LLMs and find that the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
CodeDPO: Aligning Code Models with Self Generated and Verified Source Code [52.70310361822519]
We propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency. CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases.
arXiv Detail & Related papers (2024-10-08T01:36:15Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models [8.22619177301814]
We introduce TestBench, a benchmark for class-level LLM-based test case generation. We construct a dataset of 108 Java programs from 9 real-world, large-scale projects on GitHub. We propose a fine-grained evaluation framework that considers five aspects of test cases: syntactic correctness, compilation correctness, test correctness, code coverage rate, and defect detection rate.
arXiv Detail & Related papers (2024-09-26T06:18:06Z)
Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting [6.938766764201549]
This paper introduces an automated approach to develop test cases by exploiting the power of large language models and statistical techniques. We analyze the behavioral test profiles across four different classification algorithms and discuss the limitations and strengths of those models.
arXiv Detail & Related papers (2024-07-31T21:12:21Z)
CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.<n>Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)<n>We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z)
Fuzzy Inference System for Test Case Prioritization in Software Testing [0.0]
Test case prioritization ( TCP) is a vital strategy to enhance testing efficiency. This paper introduces a novel fuzzy logic-based approach to automate TCP.
arXiv Detail & Related papers (2024-04-25T08:08:54Z)
Automated Test Case Repair Using Language Models [0.5708902722746041]
Unrepaired broken test cases can degrade test suite quality and disrupt the software development process.<n>We present TaRGET, a novel approach leveraging pre-trained code language models for automated test case repair.<n>TaRGET treats test repair as a language translation task, employing a two-step process to fine-tune a language model.
arXiv Detail & Related papers (2024-01-12T18:56:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.