Related papers: Chasing Shadows: Pitfalls in LLM Security Research

Chasing Shadows: Pitfalls in LLM Security Research

URL: http://arxiv.org/abs/2512.09549v2
Date: Mon, 15 Dec 2025 08:58:27 GMT
Title: Chasing Shadows: Pitfalls in LLM Security Research
Authors: Jonathan Evertz, Niklas Risse, Nicolai Neuer, Andreas Müller, Philipp Normann, Gaetano Sapia, Srishti Gupta, David Pape, Soumya Shaw, Devansh Srivastav, Christian Wressnegger, Erwin Quiring, Thorsten Eisenhofer, Daniel Arp, Lea Schönherr,
Abstract summary: We identify nine common pitfalls that have become relevant with the emergence of large language models (LLMs)<n>These pitfalls span the entire process, from data collection, pre-training, and fine-tuning to prompting and evaluation.<n>We find that every paper contains at least one pitfall, and each pitfall appears in multiple papers. Yet only 15.7% of the present pitfalls were explicitly discussed, suggesting that the majority remain unrecognized.
Score: 14.334369124449346
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly prevalent in security research. Their unique characteristics, however, introduce challenges that undermine established paradigms of reproducibility, rigor, and evaluation. Prior work has identified common pitfalls in traditional machine learning research, but these studies predate the advent of LLMs. In this paper, we identify nine common pitfalls that have become (more) relevant with the emergence of LLMs and that can compromise the validity of research involving them. These pitfalls span the entire computation process, from data collection, pre-training, and fine-tuning to prompting and evaluation. We assess the prevalence of these pitfalls across all 72 peer-reviewed papers published at leading Security and Software Engineering venues between 2023 and 2024. We find that every paper contains at least one pitfall, and each pitfall appears in multiple papers. Yet only 15.7% of the present pitfalls were explicitly discussed, suggesting that the majority remain unrecognized. To understand their practical impact, we conduct four empirical case studies showing how individual pitfalls can mislead evaluation, inflate performance, or impair reproducibility. Based on our findings, we offer actionable guidelines to support the community in future work.

Related papers

An Audit of Machine Learning Experiments on Software Defect Prediction [1.2743036577573925]
Machine learning algorithms are widely used to predict defect prone software components.<n>This paper audits recent software defect prediction (SDP) studies by assessing their experimental design, analysis, and reporting practices.
arXiv Detail & Related papers (2026-01-26T13:31:32Z)
Measuring what Matters: Construct Validity in Large Language Model Benchmarks [103.53142193393931]
evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment.<n>We conduct a systematic review of 445 benchmarks from leading conferences in natural language processing and machine learning.<n>We find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims.
arXiv Detail & Related papers (2025-11-03T17:39:40Z)
Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination [77.69093448529455]
We present an empirical study using an infinitely scalable framework to synthesize research-level QA directly from arXiv papers.<n>We evaluate a lack of significant performance decay near knowledge cutoff dates for models of various sizes, developers, and release dates.<n>We hypothesize that the multi-step reasoning required by our synthesis pipeline offered additional complexity that goes deeper than shallow memorization.
arXiv Detail & Related papers (2025-08-26T16:41:37Z)
Reproducibility of Machine Learning-Based Fault Detection and Diagnosis for HVAC Systems in Buildings: An Empirical Study [7.852209218432359]
This paper analyzes the transparency and standards of Machine Learning applications in building energy systems.<n>The results indicate that nearly all articles are not reproducible due to insufficient disclosure.<n>These findings highlight the need for targeted interventions, including guidelines, training for researchers, and policies by journals and conferences.
arXiv Detail & Related papers (2025-07-23T07:35:58Z)
Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers [61.57691030102618]
We propose a novel jailbreaking method, Paper Summary Attack (llmnamePSA)<n>It synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template.<n>Experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1.
arXiv Detail & Related papers (2025-07-17T18:33:50Z)
Data Fusion for Partial Identification of Causal Effects [62.56890808004615]
We propose a novel partial identification framework that enables researchers to answer key questions.<n>Is the causal effect positive or negative? and How severe must assumption violations be to overturn this conclusion?<n>We apply our framework to the Project STAR study, which investigates the effect of classroom size on students' third-grade standardized test performance.
arXiv Detail & Related papers (2025-05-30T07:13:01Z)
Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives [52.863024096759816]
Misaligned research objectives have hindered progress in adversarial robustness research over the past decade.<n>We argue that realigned objectives are necessary for meaningful progress in adversarial alignment.
arXiv Detail & Related papers (2025-02-17T15:28:40Z)
Awes, Laws, and Flaws From Today's LLM Research [0.0]
We assess over 2,000 research works released between 2020 and 2024 based on criteria typical of what is considered good research.<n>We find multiple trends, such as declines in ethics disclaimers, a rise of LLMs as evaluators, and an increase on claims of LLM reasoning abilities without leveraging human evaluation.
arXiv Detail & Related papers (2024-08-27T21:19:37Z)
Fine-Grained and Thematic Evaluation of LLMs in Social Deduction Game [16.49767693984961]
We propose a microscopic and systematic approach to the evaluation of large language models (LLMs) in social deduction games.<n>First, we introduce six fine-grained metrics that resolve the first issue. Specifically, we introduce six fine-grained metrics that resolve the first issue.<n>To tackle the second issue, we conducted a thematic analysis and identified four major reasoning failures that undermine LLMs' performance in obscured communication.
arXiv Detail & Related papers (2024-08-19T12:35:23Z)
Reproducibility in Machine Learning-based Research: Overview, Barriers and Drivers [1.4841630983274845]
Lack of transparency, data or code, poor adherence to standards, and sensitivity of ML training mean that many papers are not even reproducible in principle.<n>Experiments have found worryingly low degrees of similarity with original results.<n>Poor integrity threatens trust in and integrity of research results.
arXiv Detail & Related papers (2024-06-20T13:56:42Z)
Enhancing Robustness of LLM-Synthetic Text Detectors for Academic Writing: A Comprehensive Analysis [35.351782110161025]
Large language models (LLMs) offer numerous advantages in terms of revolutionizing work and study methods. They have also garnered significant attention due to their potential negative consequences. One example is generating academic reports or papers with little to no human contribution.
arXiv Detail & Related papers (2024-01-16T01:58:36Z)
Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)
Too Good To Be True: performance overestimation in (re)current practices for Human Activity Recognition [49.1574468325115]
sliding windows for data segmentation followed by standard random k-fold cross validation produce biased results. It is important to raise awareness in the scientific community about this problem, whose negative effects are being overlooked. Several experiments with different types of datasets and different types of classification models allow us to exhibit the problem and show it persists independently of the method or dataset.
arXiv Detail & Related papers (2023-10-18T13:24:05Z)
Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility [37.682136465784254]
We conduct over a million queries to the mainstream large language models (LLMs) including ChatGPT, LLaMA, and OPT. We find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level. We propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation.
arXiv Detail & Related papers (2023-05-15T15:44:51Z)
Assaying Out-Of-Distribution Generalization in Transfer Learning [103.57862972967273]
We take a unified view of previous work, highlighting message discrepancies that we address empirically. We fine-tune over 31k networks, from nine different architectures in the many- and few-shot setting.
arXiv Detail & Related papers (2022-07-19T12:52:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.