Fugu-MT 論文翻訳(概要): Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

論文の概要: Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

arxiv url: http://arxiv.org/abs/2604.14604v1
Date: Thu, 16 Apr 2026 04:22:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.718855
Title: Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
Title（参考訳）: 文脈依存型および知覚不能な聴覚プロンプト注入による大規模音声言語モデルのハイジャック
Authors: Meng Chen, Kun Wang, Li Lu, Jiaheng Zhang, Tianwei Zhang,
Abstract要約: 大規模な音声言語モデル(LALM)は、音声とテキストを密に統合することでインテリジェントな音声インタラクションをパワーアップする。 textitAudioHijackは,ハイジャックLALMに対して,文脈に依存しない,知覚不能な音声を生成するフレームワークである。 13種類のLALM実験では、6つのカテゴリーで一貫したハイジャックが行われた。
参考スコア（独自算出の注目度）: 22.306688903148046
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern Large audio-language models (LALMs) power intelligent voice interactions by tightly integrating audio and text. This integration, however, expands the attack surface beyond text and introduces vulnerabilities in the continuous, high-dimensional audio channel. While prior work studied audio jailbreaks, the security risks of malicious audio injection and downstream behavior manipulation remain underexamined. In this work, we reveal a previously overlooked threat, auditory prompt injection, under realistic constraints of audio data-only access and strong perceptual stealth. To systematically analyze this threat, we propose \textit{AudioHijack}, a general framework that generates context-agnostic and imperceptible adversarial audio to hijack LALMs. \textit{AudioHijack} employs sampling-based gradient estimation for end-to-end optimization across diverse models, bypassing non-differentiable audio tokenization. Through attention supervision and multi-context training, it steers model attention toward adversarial audio and generalizes to unseen user contexts. We also design a convolutional blending method that modulates perturbations into natural reverberation, making them highly imperceptible to users. Extensive experiments on 13 state-of-the-art LALMs show consistent hijacking across 6 misbehavior categories, achieving average success rates of 79\%-96\% on unseen user contexts with high acoustic fidelity. Real-world studies demonstrate that commercial voice agents from Mistral AI and Microsoft Azure can be induced to execute unauthorized actions on behalf of users. These findings expose critical vulnerabilities in LALMs and highlight the urgent need for dedicated defense.
Abstract（参考訳）: 現代の大規模音声言語モデル(LALM)は、音声とテキストを密に統合することで、インテリジェントな音声対話を実現する。しかし、この統合は攻撃面をテキストを超えて拡張し、連続した高次元オーディオチャネルの脆弱性を導入している。以前の研究はオーディオ・ジェイルブレイクを研究していたが、悪意のあるオーディオ・インジェクションや下流での動作操作のセキュリティリスクは過小評価されている。本研究では,音声データのみのアクセスと強い知覚ステルスの現実的な制約の下で,これまで見過ごされていた脅威,聴覚的プロンプトインジェクションを明らかにする。この脅威を体系的に解析するために, LALMをハイジャックするために, 文脈に依存しない, 知覚不能な音声を生成する一般的なフレームワークである textit{AudioHijack} を提案する。 \textit{AudioHijack} は、様々なモデルにわたるエンドツーエンドの最適化のためにサンプリングベースの勾配推定を採用し、微分不可能なオーディオトークン化を回避している。注意監督とマルチコンテキストトレーニングを通じて、相手の音声に注意を向け、目に見えないユーザコンテキストに一般化する。また、摂動を自然な残響に変調する畳み込みブレンディング法を設計し、ユーザにとって非常に受け入れ難いものにした。 13の最先端のLALMの大規模な実験では、6つの誤動作カテゴリーで一貫したハイジャックを行い、音響的忠実度の高い未確認ユーザコンテキストで平均79\%-96\%の成功率を達成した。実世界の研究では、Mistral AIとMicrosoft Azureの商用音声エージェントが、ユーザに代わって不正なアクションを実行するように誘導できることが示されている。これらの発見は、LALMの重大な脆弱性を明らかにし、専用の防衛の必要性を浮き彫りにしている。

論文の概要: Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

関連論文リスト