Related papers: Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge

Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge

URL: http://arxiv.org/abs/2404.13660v1
Date: Sun, 21 Apr 2024 13:31:16 GMT
Title: Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
Authors: Narek Maloyan, Ekansh Verma, Bulat Nutfullin, Bislan Ashinov,
Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains, but their vulnerability to trojan or backdoor attacks poses significant security risks. This paper explores the challenges and insights gained from the Trojan Detection Competition 2023 (TDC2023) We investigate the difficulty of distinguishing between intended and unintended triggers, as well as the feasibility of reverse engineering trojans in real-world scenarios.
Score: 0.056247917037481096
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains, but their vulnerability to trojan or backdoor attacks poses significant security risks. This paper explores the challenges and insights gained from the Trojan Detection Competition 2023 (TDC2023), which focused on identifying and evaluating trojan attacks on LLMs. We investigate the difficulty of distinguishing between intended and unintended triggers, as well as the feasibility of reverse engineering trojans in real-world scenarios. Our comparative analysis of various trojan detection methods reveals that achieving high Recall scores is significantly more challenging than obtaining high Reverse-Engineering Attack Success Rate (REASR) scores. The top-performing methods in the competition achieved Recall scores around 0.16, comparable to a simple baseline of randomly sampling sentences from a distribution similar to the given training prefixes. This finding raises questions about the detectability and recoverability of trojans inserted into the model, given only the harmful targets. Despite the inability to fully solve the problem, the competition has led to interesting observations about the viability of trojan detection and improved techniques for optimizing LLM input prompts. The phenomenon of unintended triggers and the difficulty in distinguishing them from intended triggers highlights the need for further research into the robustness and interpretability of LLMs. The TDC2023 has provided valuable insights into the challenges and opportunities associated with trojan detection in LLMs, laying the groundwork for future research in this area to ensure their safety and reliability in real-world applications.

Related papers

Trojans in Artificial Intelligence (TrojAI) Final Report [52.6138928911574]
TrojAI was launched to confront an emerging vulnerability in modern artificial intelligence: the threat of AI Trojans.<n>TrojAI helped to map out the complex nature of the threat and pioneered foundational detection methods.<n>Report concludes with lessons learned and recommendations for advancing AI security research.
arXiv Detail & Related papers (2026-02-06T19:52:14Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
TrojanTO: Action-Level Backdoor Attacks against Trajectory Optimization Models [67.06525001375722]
TrojanTO is the first action-level backdoor attack against TO models.<n>It implants backdoor attacks across diverse tasks and attack objectives with a low attack budget.<n>TrojanTO exhibits broad applicability to DT, GDT, and DC.
arXiv Detail & Related papers (2025-06-15T11:27:49Z)
Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025 [167.94680155673046]
This report presents findings from the Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025.<n>The competition involved 86 teams testing MLLM vulnerabilities via adversarial image-text attacks in two phases: white-box and black-box evaluations.<n>The challenge establishes new benchmarks for MLLM safety evaluation and lays groundwork for advancing safer AI systems.
arXiv Detail & Related papers (2025-06-14T10:03:17Z)
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting. Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines. Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z)
Scanning Trojaned Models Using Out-of-Distribution Samples [8.701370432442216]
trojan (backdoor) in deep neural networks is crucial due to their significant real-world applications. We introduce a novel scanning method named TRODO (TROjan scanning by Detection of adversarial shifts in Out-of-distribution samples) TRODO is both trojan and label mapping, effective even against adversarially trained trojaned classifiers.
arXiv Detail & Related papers (2025-01-28T18:53:14Z)
Trojan Detection Through Pattern Recognition for Large Language Models [0.8571111167616167]
Trojan backdoors can be injected into large language models at various stages. We propose a multistage framework for detecting Trojan triggers in large language models.
arXiv Detail & Related papers (2025-01-20T17:36:04Z)
Uncertainty-Aware Hardware Trojan Detection Using Multimodal Deep Learning [3.118371710802894]
The risk of hardware Trojans being inserted at various stages of chip production has increased in a zero-trust fabless era. We propose a multimodal deep learning approach to detect hardware Trojans and evaluate the results from both early fusion and late fusion strategies.
arXiv Detail & Related papers (2024-01-15T05:45:51Z)
Poisoning Retrieval Corpora by Injecting Adversarial Passages [79.14287273842878]
We propose a novel attack for dense retrieval systems in which a malicious user generates a small number of adversarial passages. When these adversarial passages are inserted into a large retrieval corpus, we show that this attack is highly effective in fooling these systems. We also benchmark and compare a range of state-of-the-art dense retrievers, both unsupervised and supervised.
arXiv Detail & Related papers (2023-10-29T21:13:31Z)
Risk-Aware and Explainable Framework for Ensuring Guaranteed Coverage in Evolving Hardware Trojan Detection [2.6396287656676733]
In high-risk and sensitive domain, we cannot accept even a small misclassification. In this paper, we generate evolving hardware Trojans using our proposed novel conformalized generative adversarial networks. The proposed approach has been validated on both synthetic and real chip-level benchmarks.
arXiv Detail & Related papers (2023-10-14T03:30:21Z)
Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks [72.03945355787776]
We advocate MDP, a lightweight, pluggable, and effective defense for PLMs as few-shot learners. We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness.
arXiv Detail & Related papers (2023-09-23T04:41:55Z)
Game of Trojans: A Submodular Byzantine Approach [9.512062990461212]
We provide an analytical characterization of adversarial capability and strategic interactions between the adversary and detection mechanism. We propose a Submodular Trojan algorithm to determine the minimal fraction of samples to inject a Trojan trigger. We show that the adversary wins the game with probability one, thus bypassing detection.
arXiv Detail & Related papers (2022-07-13T03:12:26Z)
Quarantine: Sparsity Can Uncover the Trojan Attack Trigger for Free [126.15842954405929]
Trojan attacks threaten deep neural networks (DNNs) by poisoning them to behave normally on most samples, yet to produce manipulated results for inputs attached with a trigger. We propose a novel Trojan network detection regime: first locating a "winning Trojan lottery ticket" which preserves nearly full Trojan information yet only chance-level performance on clean inputs; then recovering the trigger embedded in this already isolated subnetwork.
arXiv Detail & Related papers (2022-05-24T06:33:31Z)
Trigger Hunting with a Topological Prior for Trojan Detection [16.376009231934884]
This paper tackles the problem of Trojan detection, namely, identifying Trojaned models. One popular approach is reverse engineering, recovering the triggers on a clean image by manipulating the model's prediction. One major challenge of reverse engineering approach is the enormous search space of triggers. We propose innovative priors such as diversity and topological simplicity to not only increase the chances of finding the appropriate triggers but also improve the quality of the found triggers.
arXiv Detail & Related papers (2021-10-15T19:47:00Z)
Cassandra: Detecting Trojaned Networks from Adversarial Perturbations [92.43879594465422]
In many cases, pre-trained models are sourced from vendors who may have disrupted the training pipeline to insert Trojan behaviors into the models. We propose a method to verify if a pre-trained model is Trojaned or benign. Our method captures fingerprints of neural networks in the form of adversarial perturbations learned from the network gradients.
arXiv Detail & Related papers (2020-07-28T19:00:40Z)
Odyssey: Creation, Analysis and Detection of Trojan Models [91.13959405645959]
Trojan attacks interfere with the training pipeline by inserting triggers into some of the training samples and trains the model to act maliciously only for samples that contain the trigger. Existing Trojan detectors make strong assumptions about the types of triggers and attacks. We propose a detector that is based on the analysis of the intrinsic properties; that are affected due to the Trojaning process.
arXiv Detail & Related papers (2020-07-16T06:55:00Z)
Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch. We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types. In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.