See No Evil: Adversarial Attacks Against Linguistic-Visual Association in Referring Multi-Object Tracking Systems
- URL: http://arxiv.org/abs/2509.02028v2
- Date: Wed, 03 Sep 2025 02:28:19 GMT
- Title: See No Evil: Adversarial Attacks Against Linguistic-Visual Association in Referring Multi-Object Tracking Systems
- Authors: Halima Bouzidi, Haoyu Liu, Mohammad Abdullah Al Faruque,
- Abstract summary: We present VEIL, a novel adversarial framework designed to disrupt the unified referring-matching mechanisms of RMOT models.<n>We show that carefully crafted digital and physical perturbations can corrupt the tracking logic reliability, inducing track ID switches and terminations.
- Score: 21.34084466103555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language-vision understanding has driven the development of advanced perception systems, most notably the emerging paradigm of Referring Multi-Object Tracking (RMOT). By leveraging natural-language queries, RMOT systems can selectively track objects that satisfy a given semantic description, guided through Transformer-based spatial-temporal reasoning modules. End-to-End (E2E) RMOT models further unify feature extraction, temporal memory, and spatial reasoning within a Transformer backbone, enabling long-range spatial-temporal modeling over fused textual-visual representations. Despite these advances, the reliability and robustness of RMOT remain underexplored. In this paper, we examine the security implications of RMOT systems from a design-logic perspective, identifying adversarial vulnerabilities that compromise both the linguistic-visual referring and track-object matching components. Additionally, we uncover a novel vulnerability in advanced RMOT models employing FIFO-based memory, whereby targeted and consistent attacks on their spatial-temporal reasoning introduce errors that persist within the history buffer over multiple subsequent frames. We present VEIL, a novel adversarial framework designed to disrupt the unified referring-matching mechanisms of RMOT models. We show that carefully crafted digital and physical perturbations can corrupt the tracking logic reliability, inducing track ID switches and terminations. We conduct comprehensive evaluations using the Refer-KITTI dataset to validate the effectiveness of VEIL and demonstrate the urgent need for security-aware RMOT designs for critical large-scale applications.
Related papers
- RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation [71.2136732268131]
RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions.<n>Existing RGBT trackers rely solely on initial-frame visual information for target modeling.<n>We propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking.
arXiv Detail & Related papers (2026-03-04T01:02:04Z) - AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts [78.33143446024485]
We introduce textbfAgentLongBench, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles.<n>This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios.
arXiv Detail & Related papers (2026-01-28T16:05:44Z) - Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z) - Online Segment Any 3D Thing as Instance Tracking [60.20416622842975]
We reconceptualize online 3D segmentation as an instance tracking problem (AutoSeg3D)<n>We introduce spatial consistency learning to mitigate the fragmentation problem inherent in Vision Foundation Models.<n>Our method establishes a new state-of-the-art, surpassing ESAM by 2.8 AP on ScanNet200.
arXiv Detail & Related papers (2025-12-08T14:48:51Z) - CITADEL: Continual Anomaly Detection for Enhanced Learning in IoT Intrusion Detection [9.92596575679496]
Internet of Things (IoT) is vulnerable to a wide range of cyber threats.<n>Intrusion detection systems (IDS) have been extensively studied to enhance IoT security.<n>We propose CITADEL, a self-supervised continual learning framework to extract robust representations from benign data.
arXiv Detail & Related papers (2025-08-26T21:55:26Z) - Proactive Disentangled Modeling of Trigger-Object Pairings for Backdoor Defense [0.0]
Deep neural networks (DNNs) and generative AI (GenAI) are increasingly vulnerable to backdoor attacks.<n>In this paper, we introduce DBOM, a proactive framework that leverages structured disentanglement to identify and neutralize both seen and unseen backdoor threats.<n>We show that DBOM robustly detects poisoned images prior to downstream training, significantly enhancing the security of training pipelines.
arXiv Detail & Related papers (2025-08-03T21:58:15Z) - Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion [56.566914768257035]
We present Adversarial Object Fusion (AdvOF), a novel attack framework targeting vision-and-language navigation (VLN) agents in service-oriented environments.<n>We show AdvOF can effectively degrade agent performance under adversarial conditions while maintaining minimal interference with normal navigation tasks.<n>This work advances the understanding of service security in VLM-powered navigation systems, providing computational foundations for robust service composition in physical-world deployments.
arXiv Detail & Related papers (2025-05-29T09:14:50Z) - Temporal Context Awareness: A Defense Framework Against Multi-turn Manipulation Attacks on Large Language Models [0.0]
Large Language Models (LLMs) are increasingly vulnerable to sophisticated multi-turn manipulation attacks.<n>This paper introduces the Temporal Context Awareness framework, a novel defense mechanism designed to address this challenge.<n>Preliminary evaluations on simulated adversarial scenarios demonstrate the framework's potential to identify subtle manipulation patterns.
arXiv Detail & Related papers (2025-03-18T22:30:17Z) - From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning [71.41062111470414]
Current object detectors excel at entity localization and classification, yet exhibit inherent limitations in event recognition capabilities.<n>We present a novel framework that expands the capability of standard object detectors beyond mere object recognition to complex event understanding.<n>Our key innovation lies in bridging the semantic gap between object detection and event understanding without requiring expensive task-specific training.
arXiv Detail & Related papers (2025-02-09T10:30:54Z) - Real-Time Anomaly Detection and Reactive Planning with Large Language Models [18.57162998677491]
Foundation models, e.g., large language models (LLMs), trained on internet-scale data possess zero-shot capabilities.
We present a two-stage reasoning framework that incorporates the judgement regarding potential anomalies into a safe control framework.
This enables our monitor to improve the trustworthiness of dynamic robotic systems, such as quadrotors or autonomous vehicles.
arXiv Detail & Related papers (2024-07-11T17:59:22Z) - Transformer Network for Multi-Person Tracking and Re-Identification in
Unconstrained Environment [0.6798775532273751]
Multi-object tracking (MOT) has profound applications in a variety of fields, including surveillance, sports analytics, self-driving, and cooperative robotics.
We put forward an integrated MOT method that marries object detection and identity linkage within a singular, end-to-end trainable framework.
Our system leverages a robust memory-temporal memory module that retains extensive historical observations and effectively encodes them using an attention-based aggregator.
arXiv Detail & Related papers (2023-12-19T08:15:22Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Exploiting Multi-Object Relationships for Detecting Adversarial Attacks
in Complex Scenes [51.65308857232767]
Vision systems that deploy Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples.
Recent research has shown that checking the intrinsic consistencies in the input data is a promising way to detect adversarial attacks.
We develop a novel approach to perform context consistency checks using language models.
arXiv Detail & Related papers (2021-08-19T00:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.