The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces
- URL: http://arxiv.org/abs/2512.13821v1
- Date: Mon, 15 Dec 2025 19:05:37 GMT
- Title: The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces
- Authors: Subramanyam Sahoo, Jared Junkin,
- Abstract summary: Large language models (LLMs) increasingly generate code with minimal human oversight.<n>We present a novel AI control framework that verifies untrusted code-generating models through semantic analysis.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) increasingly generate code with minimal human oversight, raising critical concerns about backdoor injection and malicious behavior. We present Cross-Trace Verification Protocol (CTVP), a novel AI control framework that verifies untrusted code-generating models through semantic orbit analysis. Rather than directly executing potentially malicious code, CTVP leverages the model's own predictions of execution traces across semantically equivalent program transformations. By analyzing consistency patterns in these predicted traces, we detect behavioral anomalies indicative of backdoors. Our approach introduces the Adversarial Robustness Quotient (ARQ), which quantifies the computational cost of verification relative to baseline generation, demonstrating exponential growth with orbit size. Theoretical analysis establishes information-theoretic bounds showing non-gamifiability -- adversaries cannot improve through training due to fundamental space complexity constraints. This work demonstrates that semantic orbit analysis provides a scalable, theoretically grounded approach to AI control for code generation tasks.
Related papers
- Beyond Input Guardrails: Reconstructing Cross-Agent Semantic Flows for Execution-Aware Attack Detection [32.301679396929536]
We propose SysName, a framework that shifts the defensive paradigm from static input filtering to execution-aware analysis.<n>SysName synthesizes fragmented operational primitives into contiguous behavioral trajectories, enabling a holistic view of system activity.<n> Empirical evaluations demonstrate that SysName effectively detects over ten distinct compound attack vectors, achieving F1-scores of 85.3% and 66.7% for node-level and path-level end-to-end attack detection, respectively.
arXiv Detail & Related papers (2026-03-04T01:59:16Z) - TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models [19.148124494194317]
We propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls.<n>Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy.<n>We demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive.
arXiv Detail & Related papers (2026-03-02T22:19:13Z) - CodeCircuit: Toward Inferring LLM-Generated Code Correctness via Attribution Graphs [13.488544043942495]
We aim to investigate whether the model's neural dynamics encode internally decodable signals that are predictive of logical validity during code generation.<n>By decomposing complex residual flows, we aim to identify the structural signatures that distinguish sound reasoning from logical failure.<n>Analysis across Python, C++, and Java confirms that intrinsic correctness signals are robust across diverse syntaxes.
arXiv Detail & Related papers (2026-02-06T03:49:15Z) - Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process [66.38541693477181]
We propose an unsupervised framework for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors.<n>By segmenting chain-of-thought traces into sentence-level'steps', we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking.<n>We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space.
arXiv Detail & Related papers (2025-12-30T05:09:11Z) - Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking [54.43083499412643]
Test-time algorithms that combine the generative power of language models with process verifiers offer a promising lever for eliciting new reasoning capabilities.<n>We introduce a new process-guided test-time sampling algorithm, VGB, which uses theoretically grounded backtracking to achieve provably better robustness to verifier errors.
arXiv Detail & Related papers (2025-10-03T16:21:14Z) - TraceRAG: A LLM-Based Framework for Explainable Android Malware Detection and Behavior Analysis [8.977634735108895]
We introduce TraceRAG, a retrieval-augmented generation (RAG) framework to deliver explainable malware detection and analysis.<n>First, TraceRAG generates summaries of method-level code snippets, which are indexed in a vector database.<n>At query time, behavior-focused questions retrieve the most semantically relevant snippets for deeper inspection.<n>Finally, based on the multi-turn analysis results, TraceRAG produces human-readable reports that present the identified malicious behaviors and their corresponding code implementations.
arXiv Detail & Related papers (2025-09-10T06:07:12Z) - MirGuard: Towards a Robust Provenance-based Intrusion Detection System Against Graph Manipulation Attacks [13.92935628832727]
MirGuard is an anomaly detection framework that combines logic-aware multi-view augmentation with contrastive representation learning.<n>MirGuard significantly outperforms state-of-the-art detectors in robustness against various graph manipulation attacks.
arXiv Detail & Related papers (2025-08-14T13:35:51Z) - When LLMs Copy to Think: Uncovering Copy-Guided Attacks in Reasoning LLMs [30.532439965854767]
Large Language Models (LLMs) have become integral to automated code analysis, enabling tasks such as vulnerability detection and code comprehension.<n>In this paper, we identify and investigate a new class of prompt-based attacks, termed Copy-Guided Attacks (CGA)<n>We show that CGA reliably induces infinite loops, premature termination, false refusals, and semantic distortions in code analysis tasks.
arXiv Detail & Related papers (2025-07-22T17:21:36Z) - Reformulation is All You Need: Addressing Malicious Text Features in DNNs [53.45564571192014]
We propose a unified and adaptive defense framework that is effective against both adversarial and backdoor attacks.<n>Our framework outperforms existing sample-oriented defense baselines across a diverse range of malicious textual features.
arXiv Detail & Related papers (2025-02-02T03:39:43Z) - Measuring Impacts of Poisoning on Model Parameters and Embeddings for Large Language Models of Code [4.305373051747465]
Large language models (LLMs) have revolutionized software development practices, yet concerns about their safety have arisen.
Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously.
In this paper, we focus on analyzing the model parameters to detect potential backdoor signals in code models.
arXiv Detail & Related papers (2024-05-19T06:53:20Z) - Analyzing Adversarial Inputs in Deep Reinforcement Learning [53.3760591018817]
We present a comprehensive analysis of the characterization of adversarial inputs, through the lens of formal verification.
We introduce a novel metric, the Adversarial Rate, to classify models based on their susceptibility to such perturbations.
Our analysis empirically demonstrates how adversarial inputs can affect the safety of a given DRL system with respect to such perturbations.
arXiv Detail & Related papers (2024-02-07T21:58:40Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - Unsupervised Controllable Generation with Self-Training [90.04287577605723]
controllable generation with GANs remains a challenging research problem.
We propose an unsupervised framework to learn a distribution of latent codes that control the generator through self-training.
Our framework exhibits better disentanglement compared to other variants such as the variational autoencoder.
arXiv Detail & Related papers (2020-07-17T21:50:35Z) - Graph Backdoor [53.70971502299977]
We present GTA, the first backdoor attack on graph neural networks (GNNs)
GTA departs in significant ways: it defines triggers as specific subgraphs, including both topological structures and descriptive features.
It can be instantiated for both transductive (e.g., node classification) and inductive (e.g., graph classification) tasks.
arXiv Detail & Related papers (2020-06-21T19:45:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.