Related papers: On the Adversarial Robustness of 3D Large Vision-Language Models

On the Adversarial Robustness of 3D Large Vision-Language Models

URL: http://arxiv.org/abs/2601.06464v1
Date: Sat, 10 Jan 2026 07:17:29 GMT
Title: On the Adversarial Robustness of 3D Large Vision-Language Models
Authors: Chao Liu, Ngai-Man Cheung,
Abstract summary: 3D Vision-Language Models (VLMs) have shown strong reasoning and generalization abilities in 3D understanding tasks.<n>Prior work in 2D VLMs has shown that the integration of visual inputs significantly increases vulnerability to adversarial attacks.<n>We present the first systematic study of adversarial robustness in point-based 3D VLMs.
Score: 23.749171815087774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 3D Vision-Language Models (VLMs), such as PointLLM and GPT4Point, have shown strong reasoning and generalization abilities in 3D understanding tasks. However, their adversarial robustness remains largely unexplored. Prior work in 2D VLMs has shown that the integration of visual inputs significantly increases vulnerability to adversarial attacks, making these models easier to manipulate into generating toxic or misleading outputs. In this paper, we investigate whether incorporating 3D vision similarly compromises the robustness of 3D VLMs. To this end, we present the first systematic study of adversarial robustness in point-based 3D VLMs. We propose two complementary attack strategies: \textit{Vision Attack}, which perturbs the visual token features produced by the 3D encoder and projector to assess the robustness of vision-language alignment; and \textit{Caption Attack}, which directly manipulates output token sequences to evaluate end-to-end system robustness. Each attack includes both untargeted and targeted variants to measure general vulnerability and susceptibility to controlled manipulation. Our experiments reveal that 3D VLMs exhibit significant adversarial vulnerabilities under untargeted attacks, while demonstrating greater resilience against targeted attacks aimed at forcing specific harmful outputs, compared to their 2D counterparts. These findings highlight the importance of improving the adversarial robustness of 3D VLMs, especially as they are deployed in safety-critical applications.

Related papers

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z)
Universal Camouflage Attack on Vision-Language Models for Autonomous Driving [67.34987318443761]
Visual language modeling for automated driving is emerging as a promising research direction.<n>VLM-AD remains vulnerable to serious security threats from adversarial attacks.<n>We propose the first Universal Camouflage Attack framework for VLM-AD.
arXiv Detail & Related papers (2025-09-24T14:52:01Z)
Reinforced Embodied Active Defense: Exploiting Adaptive Interaction for Robust Visual Perception in Adversarial 3D Environments [26.37868865624549]
Adversarial attacks in 3D environments have emerged as a critical threat to the reliability of visual perception systems.<n>We introduce Reinforced Embodied Active Defense (Rein-EAD), a proactive defense framework that leverages adaptive exploration and interaction with the environment.<n>Rein-EAD exhibits robust generalization to unseen and adaptive attacks, making it suitable for real-world complex tasks.
arXiv Detail & Related papers (2025-07-24T14:56:21Z)
Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense [90.71884758066042]
Large vision-language models (LVLMs) introduce a unique vulnerability: susceptibility to malicious attacks via visual inputs.<n>We propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism.
arXiv Detail & Related papers (2025-03-14T17:39:45Z)
AdvMono3D: Advanced Monocular 3D Object Detection with Depth-Aware Robust Adversarial Training [64.14759275211115]
We propose a depth-aware robust adversarial training method for monocular 3D object detection, dubbed DART3D. Our adversarial training approach capitalizes on the inherent uncertainty, enabling the model to significantly improve its robustness against adversarial attacks.
arXiv Detail & Related papers (2023-09-03T07:05:32Z)
On the Adversarial Robustness of Camera-based 3D Object Detection [21.091078268929667]
We investigate the robustness of leading camera-based 3D object detection approaches under various adversarial conditions. We find that bird's-eye-view-based representations exhibit stronger robustness against localization attacks. depth-estimation-free approaches have the potential to show stronger robustness. incorporating multi-frame benign inputs can effectively mitigate adversarial attacks.
arXiv Detail & Related papers (2023-01-25T18:59:15Z)
A Comprehensive Study of the Robustness for LiDAR-based 3D Object Detectors against Adversarial Attacks [84.10546708708554]
3D object detectors are increasingly crucial for security-critical tasks. It is imperative to understand their robustness against adversarial attacks. This paper presents the first comprehensive evaluation and analysis of the robustness of LiDAR-based 3D detectors under adversarial attacks.
arXiv Detail & Related papers (2022-12-20T13:09:58Z)
Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving [87.3492357041748]
In this paper, we showcase practical susceptibilities of multi-sensor detection by placing an adversarial object on top of a host vehicle. Our experiments demonstrate that successful attacks are primarily caused by easily corrupted image features. Towards more robust multi-modal perception systems, we show that adversarial training with feature denoising can boost robustness to such attacks significantly.
arXiv Detail & Related papers (2021-01-17T21:15:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.