Related papers: Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

URL: http://arxiv.org/abs/2602.05523v1
Date: Thu, 05 Feb 2026 10:30:57 GMT
Title: Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations
Authors: Shahin Honarvar, Amber Gorzynski, James Lee-Jones, Harry Coppock, Marek Rei, Joseph Ryan, Alastair F. Donaldson,
Abstract summary: Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag benchmarks.<n>We introduce CTF challenge families, whereby a single CTF is used as the basis for generating a family of semantically-equivalent challenges.<n>We introduce a new tool, Evolve-CTF, that generates CTF families from Python challenges using a range of transformations.
Score: 9.234598988803407
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks. However, existing pointwise benchmarks have limited ability to shed light on the robustness and generalisation abilities of agents across alternative versions of the source code. We introduce CTF challenge families, whereby a single CTF is used as the basis for generating a family of semantically-equivalent challenges via semantics-preserving program transformations. This enables controlled evaluation of agent robustness to source code transformations while keeping the underlying exploit strategy fixed. We introduce a new tool, Evolve-CTF, that generates CTF families from Python challenges using a range of transformations. Using Evolve-CTF to derive families from Cybench and Intercode challenges, we evaluate 13 agentic LLM configurations with tool access. We find that models are remarkably robust to intrusive renaming and code insertion-based transformations, but that composed transformations and deeper obfuscation affect performance by requiring more sophisticated use of tools. We also find that enabling explicit reasoning has little effect on solution success rates across challenge families. Our work contributes a valuable technique and tool for future LLM evaluations, and a large dataset characterising the capabilities of current state-of-the-art models in this domain.

Related papers

Fool Me If You Can: On the Robustness of Binary Code Similarity Detection Models against Semantics-preserving Transformations [7.222996408214315]
We evaluate the robustness of deep learning models for the task of binary code similarity detection.<n>We construct a dataset of 9,565 binary variants from 620 baseline samples.
arXiv Detail & Related papers (2026-02-13T07:23:15Z)
Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs [69.28193153685893]
Large Language Models (LLMs) frequently exhibit strong translation abilities, even without task-specific fine-tuning.<n>To demystify this process, we leverage Sparse Autoencoders (SAEs) and introduce a novel framework for identifying task-specific features.<n>Our work not only decodes a core component of the translation mechanism in LLMs but also provides a blueprint for using internal model mechanism to create more robust and efficient models.
arXiv Detail & Related papers (2026-01-16T06:29:07Z)
Attribution-Guided Decoding [24.52258081219335]
We introduce Attribution-Guided Decoding (AGD), an interpretability-based decoding strategy.<n>Instead of directly manipulating model activations, AGD considers a set of high-probability output token candidates.<n>We demonstrate AGD's efficacy across three challenging domains.
arXiv Detail & Related papers (2025-09-30T14:21:40Z)
Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges [0.0]
'Random-Crypto' is a procedurally generated cryptographic dataset designed to unlock the potential of Reinforcement Learning.<n>We fine-tune a Python tool-augmented Llama-3.1-8B via Group Relative Policy Optimization.<n>The resulting agent achieves a significant improvement in Pass@8 on previously unseen challenges.
arXiv Detail & Related papers (2025-06-01T01:59:52Z)
Leveraging LLM Inconsistency to Boost Pass@k Performance [3.797421474324735]
Large language models (LLMs) achieve impressive abilities in numerous domains, but exhibit inconsistent performance in response to minor input changes.<n>We introduce a novel method for leveraging models' inconsistency to boost Pass@k performance.<n>Specifically, we present a "Variator" agent that generates k variants of a given task and submits one candidate solution for each one.
arXiv Detail & Related papers (2025-05-19T10:22:04Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.<n>MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space.<n>MeCo is fine-tuning-free and incurs minimal cost.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities [46.34031902647788]
We present EnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges.<n>We introduce new tools and interfaces to improve the agent's ability to find and exploit security vulnerabilities.<n> Empirical analysis on 390 CTF challenges demonstrate that these new tools and interfaces substantially improve our agent's performance.
arXiv Detail & Related papers (2024-09-24T15:06:01Z)
CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning [17.614980614656407]
We propose Continual Generative training for Incremental prompt-Learning. We exploit Variational Autoencoders to learn class-conditioned distributions. We show that such a generative replay approach can adapt to new tasks while improving zero-shot capabilities.
arXiv Detail & Related papers (2024-07-22T16:51:28Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex. This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z)
FLAT: Few-Shot Learning via Autoencoding Transformation Regularizers [67.46036826589467]
We present a novel regularization mechanism by learning the change of feature representations induced by a distribution of transformations without using the labels of data examples. It could minimize the risk of overfitting into base categories by inspecting the transformation-augmented variations at the encoded feature level. Experiment results show the superior performances to the current state-of-the-art methods in literature.
arXiv Detail & Related papers (2019-12-29T15:26:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.