Related papers: Synergizing Code Coverage and Gameplay Intent: Coverage-Aware Game Playtesting with LLM-Guided Reinforcement Learning

Synergizing Code Coverage and Gameplay Intent: Coverage-Aware Game Playtesting with LLM-Guided Reinforcement Learning

URL: http://arxiv.org/abs/2512.12706v1
Date: Sun, 14 Dec 2025 14:18:18 GMT
Title: Synergizing Code Coverage and Gameplay Intent: Coverage-Aware Game Playtesting with LLM-Guided Reinforcement Learning
Authors: Enhong Mu, Minami Yoda, Yan Zhang, Mingyue Zhang, Yutaka Matsuno, Jialong Li,
Abstract summary: "Games as a Service" model requires frequent content updates.<n>Code-centric methods focus on structural coverage without understanding gameplay context.<n>We propose SMART, a novel framework that synergizes structural verification and functional validation for game update testing.
Score: 4.3706127838450035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The widespread adoption of the "Games as a Service" model necessitates frequent content updates, placing immense pressure on quality assurance. In response, automated game testing has been viewed as a promising solution to cope with this demanding release cadence. However, existing automated testing approaches typically create a dichotomy: code-centric methods focus on structural coverage without understanding gameplay context, while player-centric agents validate high-level intent but often fail to cover specific underlying code changes. To bridge this gap, we propose SMART (Structural Mapping for Augmented Reinforcement Testing), a novel framework that synergizes structural verification and functional validation for game update testing. SMART leverages large language models (LLMs) to interpret abstract syntax tree (AST) differences and extract functional intent, constructing a context-aware hybrid reward mechanism. This mechanism guides reinforcement learning agents to sequentially fulfill gameplay goals while adaptively exploring modified code branches. We evaluate SMART on two environments, Overcooked and Minecraft. The results demonstrate that SMART significantly outperforms state-of-the-art baselines; it achieves over 94% branch coverage of modified code, nearly double that of traditional reinforcement learning methods, while maintaining a 98% task completion rate, effectively balancing structural comprehensiveness with functional correctness.

Related papers

RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z)
From Completion to Editing: Unlocking Context-Aware Code Infilling via Search-and-Replace Instruction Tuning [81.97788535387286]
We propose a framework that internalizes the agentic verification-and-editing mechanism into a unified, single-pass inference process.<n>With minimal data, SRI-Coder enables Chat models to surpass the completion performance of their Base counterparts.<n>Unlike FIM-style tuning, SRI preserves general coding competencies and maintains inference latency comparable to standard FIM.
arXiv Detail & Related papers (2026-01-19T20:33:53Z)
AJAR: Adaptive Jailbreak Architecture for Red-teaming [1.356919241968803]
AJAR is a proof-of-concept framework designed to bridge the gap between "red-teaming" and "action security"<n>AJAR decouples adversarial logic from the execution loop, encapsulating state-of-the-art algorithms like X-Teaming as standardized, plug-and-play services.<n> AJAR is open-sourced to facilitate the standardized, environment-aware evaluation of this emerging attack surface.
arXiv Detail & Related papers (2026-01-16T03:30:40Z)
SAGE: Semantic-Aware Gray-Box Game Regression Testing with Large Language Models [12.705802209782506]
SAGE is a semanticaware regression testing framework for gray-box game environments.<n>It addresses the core challenges of test generation, maintenance, and selection.<n>It achieves superior bug detection with significantly lower execution cost, while demonstrating strong adaptability to version updates.
arXiv Detail & Related papers (2025-11-29T17:09:18Z)
Knowledge Graph-enhanced Large Language Model for Incremental Game PlayTesting [10.112811020571774]
This paper proposes a KLPEG framework to conduct precise and efficient testing tailored for incremental game updates.<n>The framework constructs and maintains a Knowledge Graph (KG) to systematically model game elements, task dependencies, and causal relationships.<n> Experiments in two representative game environments, Overcooked and Minecraft, demonstrate that KLPEG can more accurately locate functionalities affected by updates.
arXiv Detail & Related papers (2025-11-04T12:40:46Z)
WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection [51.10348385624784]
We present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism.<n>Our approach substantially extends tool-use chains and improves answer accuracy.
arXiv Detail & Related papers (2025-10-21T16:52:00Z)
FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding [11.846768103642583]
FeatBench is a novel benchmark for vibe coding that focuses on feature implementation.<n> FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark.<n>Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%.
arXiv Detail & Related papers (2025-09-26T11:47:50Z)
SAEL: Leveraging Large Language Models with Adaptive Mixture-of-Experts for Smart Contract Vulnerability Detection [14.581402965011117]
We propose SAEL, an LLM-based framework for smart contract vulnerability detection.<n>We first design targeted prompts to guide LLMs in identifying vulnerabilities and generating explanations.<n>Next, we apply prompt-tuning on CodeT5 and T5 to process contract code and explanations, enhancing task-specific performance.
arXiv Detail & Related papers (2025-07-30T04:28:00Z)
Repeton: Structured Bug Repair with ReAct-Guided Patch-and-Test Cycles [1.387448620257867]
Large Language Models (LLMs) have shown strong capabilities in code generation and comprehension, yet their application to complex software engineering tasks often suffers from low precision and limited interpretability.<n>We present Repeton, a fully open-source framework that leverages LLMs for precise and automated code manipulation in real-world Git.
arXiv Detail & Related papers (2025-06-09T19:36:40Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
SolBench: A Dataset and Benchmark for Evaluating Functional Correctness in Solidity Code Completion and Repair [51.0686873716938]
We introduce SolBench, a benchmark for evaluating the functional correctness of Solidity smart contracts generated by code completion models.<n>We propose a Retrieval-Augmented Code Repair framework to verify functional correctness of smart contracts.<n>Results show that code repair and retrieval techniques effectively enhance the correctness of smart contract completion while reducing computational costs.
arXiv Detail & Related papers (2025-03-03T01:55:20Z)
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality. Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z)
Deep Policy Networks for NPC Behaviors that Adapt to Changing Design Parameters in Roguelike Games [137.86426963572214]
Turn-based strategy games like Roguelikes, for example, present unique challenges to Deep Reinforcement Learning (DRL) We propose two network architectures to better handle complex categorical state spaces and to mitigate the need for retraining forced by design decisions.
arXiv Detail & Related papers (2020-12-07T08:47:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.