Fugu-MT 論文翻訳(概要): Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

論文の概要: Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

arxiv url: http://arxiv.org/abs/2603.25403v2
Date: Fri, 27 Mar 2026 15:01:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 18:25:50.567389
Title: Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models
Title（参考訳）: 形状と物質:局所視覚言語モデルにおける2層側チャネル攻撃
Authors: Eyal Hadad, Mordechai Guri,
Abstract要約: デバイス上のビジョンランゲージモデル(VLM)は、ローカル実行を通じてデータのプライバシを約束する。動的高分解能前処理へのアーキテクチャシフトは,アルゴリズム的なサイドチャネルを導入している。ローカルなVLMに対する2層アタック・フレームワークを実演する。
参考スコア（独自算出の注目度）: 2.1198879079315573
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.
Abstract（参考訳）: デバイス上のビジョンランゲージモデル(VLM)は、ローカル実行を通じてデータのプライバシを約束する。しかし、動的高分解能前処理(例えば、AnyRes)へのアーキテクチャシフトは、本質的にアルゴリズム的なサイドチャネルを導入している。静的モデルとは異なり、動的プリプロセッシングはイメージをアスペクト比に基づいて可変数のパッチに分解し、ワークロード依存の入力を生成する。ローカルなVLMに対する2層アタック・フレームワークを実演する。 Tier 1では、非特権の攻撃者は、標準の非特権のOSメトリクスを使用して、入力の幾何学を確実にフィンガープリントすることができる。 Tier 2では、Last-Level Cache (LLC) 競合をプロファイリングすることで、攻撃者は同一のジオメトリ内のセマンティックなあいまいさを解決し、視覚的に密度の高い(例えば、医療用X線)とスパース(例えば、テキスト文書)のコンテンツを区別することができる。 LLaVA-NeXTやQwen2-VLといった最先端モデルを評価することで、これらの信号を組み合わせることで、プライバシーに敏感なコンテキストの信頼できる推測が可能になることを示す。最後に、この脆弱性を緩和するセキュリティエンジニアリングのトレードオフを分析し、定常的なパディングによる大幅なパフォーマンスオーバーヘッドを明らかにし、セキュアなエッジAIデプロイメントのための実用的な設計レコメンデーションを提案する。

論文の概要: Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

関連論文リスト