DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems
- URL: http://arxiv.org/abs/2601.12786v1
- Date: Mon, 19 Jan 2026 07:39:22 GMT
- Title: DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems
- Authors: Suyang Sun, Weifei Jin, Yuxin Cao, Wei Song, Jie Hao,
- Abstract summary: We propose Dual-task Universal Adversarial Perturbation (DUAP)<n>DUAP employs a targeted surrogate objective to effectively disrupt ASR transcription and introduces a Dynamic Normalized Ensemble (DNE) strategy to enhance transferability across diverse SR models.<n> Extensive evaluations across five ASR and six SR models demonstrate that DUAP achieves high simultaneous attack success rates and superior imperceptibility.
- Score: 10.342045511863288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern Voice Control Systems (VCS) rely on the collaboration of Automatic Speech Recognition (ASR) and Speaker Recognition (SR) for secure interaction. However, prior adversarial attacks typically target these tasks in isolation, overlooking the coupled decision pipeline in real-world scenarios. Consequently, single-task attacks often fail to pose a practical threat. To fill this gap, we first utilize gradient analysis to reveal that ASR and SR exhibit no inherent conflicts. Building on this, we propose Dual-task Universal Adversarial Perturbation (DUAP). Specifically, DUAP employs a targeted surrogate objective to effectively disrupt ASR transcription and introduces a Dynamic Normalized Ensemble (DNE) strategy to enhance transferability across diverse SR models. Furthermore, we incorporate psychoacoustic masking to ensure perturbation imperceptibility. Extensive evaluations across five ASR and six SR models demonstrate that DUAP achieves high simultaneous attack success rates and superior imperceptibility, significantly outperforming existing single-task baselines.
Related papers
- ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction [24.416258744287166]
ICON is a probing-to-mitigation framework that neutralizes attacks while preserving task continuity.<n>ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain.
arXiv Detail & Related papers (2026-02-24T09:13:05Z) - MORE: Multi-Objective Adversarial Attacks on Speech Recognition [39.77140497042348]
Large-scale automatic speech recognition (ASR) models such as Whisper have expanded their adoption across diverse real-world applications.<n> robustness against even minor input perturbations is therefore critical for maintaining reliable performance in real-time environments.<n>We introduce MORE, a multi-objective repetitive doubling encouragement attack, which jointly degrades recognition accuracy and inference efficiency.
arXiv Detail & Related papers (2026-01-05T07:27:57Z) - Debiased Dual-Invariant Defense for Adversarially Robust Person Re-Identification [52.63017280231648]
Person re-identification (ReID) is a fundamental task in many real-world applications such as pedestrian trajectory tracking.<n>Person ReID models are highly susceptible to adversarial attacks, where imperceptible perturbations to pedestrian images can cause entirely incorrect predictions.<n>We propose a dual-invariant defense framework composed of two main phases.
arXiv Detail & Related papers (2025-11-13T03:56:40Z) - Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses [71.34350093068473]
This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR)<n>Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models.<n>Our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain.
arXiv Detail & Related papers (2025-10-15T08:27:16Z) - The Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems [101.68501850486179]
We explore adversarial attacks against retrieval-augmented generation (RAG) systems to identify their vulnerabilities.<n>This task aims to find imperceptible perturbations that retrieve a target document, originally excluded from the initial top-$k$ candidate set.<n>We propose ReGENT, a reinforcement learning-based framework that tracks interactions between the attacker and the target RAG.
arXiv Detail & Related papers (2025-05-24T08:19:25Z) - Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control [60.35553925189286]
We propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture.
We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets.
arXiv Detail & Related papers (2024-06-19T21:11:01Z) - Watch What You Pretrain For: Targeted, Transferable Adversarial Examples
on Self-Supervised Speech Recognition models [27.414693266500603]
A targeted adversarial attack produces audio samples that can force an Automatic Speech Recognition system to output attacker-chosen text.
Recent work has shown that transferability against large ASR models is very difficult.
We show that modern ASR architectures, specifically ones based on Self-Supervised Learning, are in fact vulnerable to transferability.
arXiv Detail & Related papers (2022-09-17T15:01:26Z) - Sequential Randomized Smoothing for Adversarially Robust Speech
Recognition [26.96883887938093]
We show that our strongest defense is robust to all attacks that use inaudible noise, and can only be broken with very high distortion.
Our paper overcomes some of these challenges by leveraging speech-specific tools like enhancement and ROVER voting to design an ASR model that is robust to perturbations.
arXiv Detail & Related papers (2021-11-05T21:51:40Z) - Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR.
We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.