Fugu-MT 論文翻訳(概要): How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

論文の概要: How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

arxiv url: http://arxiv.org/abs/2605.05340v2
Date: Fri, 08 May 2026 01:54:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 16:31:22.928316
Title: How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
Title（参考訳）: 物理世界におけるVLMのプライバシ意識はどこまであるのか? : 実証的研究
Authors: Junran Wang, Xinjie Shen, Zehao Jin, Pan Li,
Abstract要約: VLM(Vision-Language Models)は、エンボディされたアシスタントのための自律的な認知コアとして、ますます多くデプロイされている。 ImmersedPrivacyはリアルな物理的環境をシミュレートするインタラクティブな音声視覚評価フレームワークである。 12の最先端モデルを評価すると、一貫した欠点が明らかになる。
参考スコア（独自算出の注目度）: 7.537653216205245
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Vision-Language Models (VLMs) are increasingly deployed as autonomous cognitive cores for embodied assistants, evaluating their privacy awareness in physical environments becomes critical. Unlike digital chatbots, these agents operate in intimate spaces, such as homes and hospitals, where they possess the physical agency to observe and manipulate privacy-sensitive information and artifacts. However, current benchmarks remain limited to unimodal, text-based representations that cannot capture the demands of real-world settings. To bridge this gap, we present ImmersedPrivacy, an interactive audio-visual evaluation framework that simulates realistic physical environments using a Unity-based simulator. ImmersedPrivacy evaluates physically grounded privacy awareness across three progressive tiers that test a model's ability to identify sensitive items in cluttered scenes, adapt to shifting social contexts, and resolve conflicts between explicit commands and inferred privacy constraints. Our evaluation of 12 state-of-the-art models reveals consistent deficits. In cluttered scenes, all models exhibit monotonic performance decay as scene complexity grows due to perceptual deficit. When social context shifts, no model exceed 65% selection accuracy. Under conflicting commands, the best model gemini-3.1-pro perfectly balances task completion and privacy preservation in only 51% of cases. These findings reveal that current VLMs in the physical world suffer from perceptual fragility and fail to let their knowledge of privacy cues govern their situated behavior. Our code and data is available at https://github.com/immersed-privacy/immersed-privacy .
Abstract（参考訳）: VLM(Vision-Language Models)は、エンボディされたアシスタントのための自律的な認知コアとしてますます多くデプロイされているため、物理的な環境でのプライバシ意識を評価することが重要になる。デジタルチャットボットとは異なり、これらのエージェントは、プライバシーに敏感な情報やアーティファクトを監視・操作するための物理的機関を持つ家庭や病院などの近親密な空間で機能する。しかし、現在のベンチマークは、実世界の設定の要求を捉えられない、平凡なテキストベースの表現に限られている。このギャップを埋めるために、Unityベースのシミュレーターを用いてリアルな物理的環境をシミュレートするインタラクティブなオーディオ視覚評価フレームワークであるImmersedPrivacyを提案する。 ImmersedPrivacyは、3つのプログレッシブ層にまたがって物理的に根ざしたプライバシー意識を評価し、それは、散らかったシーンで機密事項を識別し、社会的コンテキストの変化に適応し、明示的なコマンドと推論されたプライバシー制約の間の衝突を解決する。 12の最先端モデルを評価すると、一貫した欠点が明らかになる。散らばったシーンでは、全てのモデルが知覚障害によりシーンの複雑さが増大するにつれて単調なパフォーマンス劣化を示す。社会的文脈が変化すると、選択精度が65%を超えるモデルはない。矛盾するコマンドの下では、最高のモデル gemini-3.1-pro はタスクの完了とプライバシの保存を51%のケースで完全にバランスさせる。これらの結果から、現在の物理世界のVLMは知覚の脆弱さに悩まされ、プライバシーに関する知識が彼らの位置する行動を管理するのに失敗していることが明らかとなった。私たちのコードとデータはhttps://github.com/immersed-privacy/immersed-privacy で利用可能です。

論文の概要: How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

関連論文リスト