Related papers: Securing LLM-Generated Embedded Firmware through AI Agent-Driven Validation and Patching

Securing LLM-Generated Embedded Firmware through AI Agent-Driven Validation and Patching

URL: http://arxiv.org/abs/2509.09970v1
Date: Fri, 12 Sep 2025 05:15:35 GMT
Title: Securing LLM-Generated Embedded Firmware through AI Agent-Driven Validation and Patching
Authors: Seyed Moein Abtahi, Akramul Azim,
Abstract summary: Large Language Models (LLMs) show promise in generating firmware for embedded systems, but often introduce security flaws and fail to meet real-time performance constraints.<n>This paper proposes a three-phase methodology that combines LLM-based firmware generation with automated security validation and iterative refinement.
Score: 0.9582466286528458
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) show promise in generating firmware for embedded systems, but often introduce security flaws and fail to meet real-time performance constraints. This paper proposes a three-phase methodology that combines LLM-based firmware generation with automated security validation and iterative refinement in a virtualized environment. Using structured prompts, models like GPT-4 generate firmware for networking and control tasks, deployed on FreeRTOS via QEMU. These implementations are tested using fuzzing, static analysis, and runtime monitoring to detect vulnerabilities such as buffer overflows (CWE-120), race conditions (CWE-362), and denial-of-service threats (CWE-400). Specialized AI agents for Threat Detection, Performance Optimization, and Compliance Verification collaborate to improve detection and remediation. Identified issues are categorized using CWE, then used to prompt targeted LLM-generated patches in an iterative loop. Experiments show a 92.4\% Vulnerability Remediation Rate (37.3\% improvement), 95.8\% Threat Model Compliance, and 0.87 Security Coverage Index. Real-time metrics include 8.6ms worst-case execution time and 195{\mu}s jitter. This process enhances firmware security and performance while contributing an open-source dataset for future research.

Related papers

Execution-State-Aware LLM Reasoning for Automated Proof-of-Vulnerability Generation [36.950993500170014]
We present DrillAgent, an agentic framework that reformulates PoV generation as an iterative hypothesis-verification-refinement process.<n>We evaluate DrillAgent on SEC-bench, a large-scale benchmark of real-world C/C++ vulnerabilities.
arXiv Detail & Related papers (2026-02-14T03:17:27Z)
RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z)
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z)
Towards Verifiably Safe Tool Use for LLM Agents [53.55621104327779]
Large language model (LLM)-based AI agents extend capabilities by enabling access to tools such as data sources, APIs, search engines, code sandboxes, and even other agents.<n>LLMs may invoke unintended tool interactions and introduce risks, such as leaking sensitive data or overwriting critical records.<n>Current approaches to mitigate these risks, such as model-based safeguards, enhance agents' reliability but cannot guarantee system safety.
arXiv Detail & Related papers (2026-01-12T21:31:38Z)
CHASE: LLM Agents for Dissecting Malicious PyPI Packages [2.384873896423002]
Large Language Models (LLMs) offer promising capabilities for automated code analysis.<n>Their application to security-critical malware detection faces fundamental challenges, including hallucination and context confusion.<n>We present CHASE, a high-reliability multi-agent architecture that addresses these limitations.
arXiv Detail & Related papers (2026-01-11T10:06:14Z)
Improving LLM-Assisted Secure Code Generation through Retrieval-Augmented-Generation and Multi-Tool Feedback [1.1017250479834206]
Large Language Models (LLMs) can generate code but often introduce security vulnerabilities, logical inconsistencies, and compilation errors.<n>We propose a retrieval-augmented, multi-tool repair workflow in which a single code-generating LLM iteratively refines its outputs.
arXiv Detail & Related papers (2026-01-01T23:34:00Z)
LLMs-Powered Real-Time Fault Injection: An Approach Toward Intelligent Fault Test Cases Generation [1.9435397960631864]
A novel Large Language Models (LLMs)-assisted fault test cases (TCs) generation approach is proposed in this paper.<n>The proposed approach exhibits high performance in terms of FSRs classification and fault TCs generation with F1-score of 88% and 97.5%, respectively.
arXiv Detail & Related papers (2025-11-24T13:57:31Z)
ParaVul: A Parallel Large Language Model and Retrieval-Augmented Framework for Smart Contract Vulnerability Detection [43.41293570032631]
ParaVul is a retrieval-augmented framework to improve the reliability and accuracy of smart contract vulnerability detection.<n>We develop Sparse Low-Rank Adaptation (SLoRA) for LLM fine-tuning.<n>We construct a vulnerability contract dataset and develop a hybrid Retrieval-Augmented Generation (RAG) system.
arXiv Detail & Related papers (2025-10-20T03:23:41Z)
Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents [4.303444472156151]
Large Language Model (LLM)-enabled agents are rapidly emerging across a wide range of applications.<n>This work presents the first study of time-of-check to time-of-use (TOCTOU) vulnerabilities in LLM-enabled agents.<n>We introduce TOCTOU-Bench, a benchmark with 66 realistic user tasks designed to evaluate this class of vulnerabilities.
arXiv Detail & Related papers (2025-08-23T22:41:49Z)
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.<n>We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.<n> Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories [8.583591493627276]
We introduce JitVul, a vulnerability detection benchmark linking each function to its vulnerability-introducing and fixing commits.<n>We show that ReAct Agents, leveraging thought-action-observation and interprocedural context, perform better than LLMs in distinguishing vulnerable from benign code.
arXiv Detail & Related papers (2025-03-05T15:22:24Z)
LLM4CVE: Enabling Iterative Automated Vulnerability Repair with Large Language Models [9.946058168276744]
Large Language Models (LLM) have opened up the possibility for many software defects to be patched automatically.<n>We propose an iterative pipeline that robustly fixes vulnerable functions in real-world code with high accuracy.<n>We achieve a human-verified quality score of 8.51/10 and an increase in groundtruth code similarity of 20% with Llama 3 70B.
arXiv Detail & Related papers (2025-01-07T00:21:42Z)
Can LLM Prompting Serve as a Proxy for Static Analysis in Vulnerability Detection [9.269926508651091]
Large language models (LLMs) have shown limited ability on safety-critical code tasks such as vulnerability detection.<n>We propose prompting strategies that integrate natural language instructions of vulnerabilities with contrastive chain-of-thought reasoning.<n>Our findings demonstrate that security-aware prompting techniques can be effective alternatives to the laborious, hand-crafted rules of static analyzers.
arXiv Detail & Related papers (2024-12-16T18:08:14Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.<n>Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.<n>It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.