Concept Enhancement Engineering: A Lightweight and Efficient Robust Defense Against Jailbreak Attacks in Embodied AI
- URL: http://arxiv.org/abs/2504.13201v1
- Date: Tue, 15 Apr 2025 03:50:04 GMT
- Title: Concept Enhancement Engineering: A Lightweight and Efficient Robust Defense Against Jailbreak Attacks in Embodied AI
- Authors: Jirui Yang, Zheyu Lin, Shuhan Yang, Zhihui Lu, Xin Du,
- Abstract summary: Embodied Intelligence (EI) systems integrated with large language models (LLMs) face significant security risks.<n>Traditional defense strategies, such as input filtering and output monitoring, often introduce high computational overhead.<n>We propose Concept Enhancement Engineering (CEE) to enhance the safety of embodied LLMs by dynamically steering their internal activations.
- Score: 19.094809384824064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embodied Intelligence (EI) systems integrated with large language models (LLMs) face significant security risks, particularly from jailbreak attacks that manipulate models into generating harmful outputs or executing unsafe physical actions. Traditional defense strategies, such as input filtering and output monitoring, often introduce high computational overhead or interfere with task performance in real-time embodied scenarios. To address these challenges, we propose Concept Enhancement Engineering (CEE), a novel defense framework that leverages representation engineering to enhance the safety of embodied LLMs by dynamically steering their internal activations. CEE operates by (1) extracting multilingual safety patterns from model activations, (2) constructing control directions based on safety-aligned concept subspaces, and (3) applying subspace concept rotation to reinforce safe behavior during inference. Our experiments demonstrate that CEE effectively mitigates jailbreak attacks while maintaining task performance, outperforming existing defense methods in both robustness and efficiency. This work contributes a scalable and interpretable safety mechanism for embodied AI, bridging the gap between theoretical representation engineering and practical security applications. Our findings highlight the potential of latent-space interventions as a viable defense paradigm against emerging adversarial threats in physically grounded AI systems.
Related papers
- An Approach to Technical AGI Safety and Security [72.83728459135101]
We develop an approach to address the risk of harms consequential enough to significantly harm humanity.<n>We focus on technical approaches to misuse and misalignment.<n>We briefly outline how these ingredients could be combined to produce safety cases for AGI systems.
arXiv Detail & Related papers (2025-04-02T15:59:31Z) - Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense [90.71884758066042]
Large vision-language models (LVLMs) introduce a unique vulnerability: susceptibility to malicious attacks via visual inputs.<n>We propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism.
arXiv Detail & Related papers (2025-03-14T17:39:45Z) - How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation [39.44000290664494]
Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability.<n>This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem.<n>We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model's ability to distinguish between harmful and benign inputs.
arXiv Detail & Related papers (2025-02-20T12:07:40Z) - Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [26.812138599896997]
We propose Reasoning-to-Defend (R2D), a novel training paradigm that integrates safety reflections of queries and responses into LLM's generation process.<n>R2D effectively mitigates various attacks and improves overall safety, highlighting the substantial potential of safety-aware reasoning in strengthening LLMs' robustness against jailbreaks.
arXiv Detail & Related papers (2025-02-18T15:48:46Z) - Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks [25.212057612342218]
Large language models (LLMs) are susceptible to jailbreak attacks, which exploit system vulnerabilities to bypass safety measures and generate harmful outputs.<n>We propose a Latent-space Adversarial Training with Post-aware framework to address this problem.
arXiv Detail & Related papers (2025-01-18T02:57:12Z) - In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models [104.94706600050557]
Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community.<n>We propose ICER, a novel red-teaming framework that generates interpretable and semantic meaningful problematic prompts.<n>Our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
arXiv Detail & Related papers (2024-11-25T04:17:24Z) - Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics [68.36528819227641]
This paper systematically quantifies the robustness of VLA-based robotic systems.<n>We introduce two untargeted attack objectives that leverage spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory.<n>We design an adversarial patch generation approach that places a small, colorful patch within the camera's view, effectively executing the attack in both digital and physical environments.
arXiv Detail & Related papers (2024-11-18T01:52:20Z) - Defining and Evaluating Physical Safety for Large Language Models [62.4971588282174]
Large Language Models (LLMs) are increasingly used to control robotic systems such as drones.
Their risks of causing physical threats and harm in real-world applications remain unexplored.
We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations.
arXiv Detail & Related papers (2024-11-04T17:41:25Z) - Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-based Decision-Making Systems [27.316115171846953]
Large Language Models (LLMs) have shown significant promise in real-world decision-making tasks for embodied AI.
LLMs are fine-tuned to leverage their inherent common sense and reasoning abilities while being tailored to specific applications.
This fine-tuning process introduces considerable safety and security vulnerabilities, especially in safety-critical cyber-physical systems.
arXiv Detail & Related papers (2024-05-27T17:59:43Z) - Attention-Based Real-Time Defenses for Physical Adversarial Attacks in
Vision Applications [58.06882713631082]
Deep neural networks exhibit excellent performance in computer vision tasks, but their vulnerability to real-world adversarial attacks raises serious security concerns.
This paper proposes an efficient attention-based defense mechanism that exploits adversarial channel-attention to quickly identify and track malicious objects in shallow network layers.
It also introduces an efficient multi-frame defense framework, validating its efficacy through extensive experiments aimed at evaluating both defense performance and computational cost.
arXiv Detail & Related papers (2023-11-19T00:47:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.