Fugu-MT 論文翻訳(概要): VoxSafeBench: Not Just What Is Said, but Who, How, and Where

論文の概要: VoxSafeBench: Not Just What Is Said, but Who, How, and Where

arxiv url: http://arxiv.org/abs/2604.14548v1
Date: Thu, 16 Apr 2026 02:24:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.68479
Title: VoxSafeBench: Not Just What Is Said, but Who, How, and Where
Title（参考訳）: VoxSafeBench: 言ったことだけでなく、誰が、どのように、どこで、どこに?
Authors: Yuxiang Wang, Hongyu Liu, Yijiang Xu, Qinke Ni, Li Wang, Wan Lin, Kunyu Feng, Dekun Chen, Xu Tan, Lei Wang, Jie Shi, Zhizheng Wu,
Abstract要約: 本稿では,音声モデルのソーシャルアライメントを共同評価する最初のベンチマークとして,VoxSafeBenchを紹介する。 Tier 1は一致したテキストと音声入力を用いてコンテンツ中心のリスクを評価する。タイア2は、書き起こしが良性であるが、適切な応答ヒンジがスピーカーに作用する音声条件付リスクを目標とする。私たちは、テキスト上で堅牢なセーフガードが、しばしば音声で劣化していることに気付きました。
参考スコア（独自算出の注目度）: 30.964672315074413
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As speech language models (SLMs) transition from personal devices into shared, multi-user environments, their responses must account for far more than the words alone. Who is speaking, how they sound, and where the conversation takes place can each turn an otherwise benign request into one that is unsafe, unfair, or privacy-violating. Existing benchmarks, however, largely focus on basic audio comprehension, study individual risks in isolation, or conflate content that is inherently harmful with content that only becomes problematic due to its acoustic context. We introduce VoxSafeBench, among the first benchmarks to jointly evaluate social alignment in SLMs across three dimensions: safety, fairness, and privacy. VoxSafeBench adopts a Two-Tier design: Tier1 evaluates content-centric risks using matched text and audio inputs, while Tier2 targets audio-conditioned risks in which the transcript is benign but the appropriate response hinges on the speaker, paralinguistic cues, or the surrounding environment. To validate Tier2, we include intermediate perception probes and confirm that frontier SLMs can successfully detect these acoustic cues yet still fail to act on them appropriately. Across 22 tasks with bilingual coverage, we find that safeguards appearing robust on text often degrade in speech: safety awareness drops for speaker- and scene-conditioned risks, fairness erodes when demographic differences are conveyed vocally, and privacy protections falter when contextual cues arrive acoustically. Together, these results expose a pervasive speech grounding gap: current SLMs frequently recognize the relevant social norm in text but fail to apply it when the decisive cue must be grounded in speech. Code and data are publicly available at: https://amphionteam.github.io/VoxSafeBench_demopage/
Abstract（参考訳）: 音声言語モデル(SLM)がパーソナルデバイスから共有されたマルチユーザ環境へと移行するにつれて、その応答は単語単独よりもはるかに多く考慮されなければならない。誰が話すか、どのように聞こえるか、どこで会話が行われるかは、それぞれが不公平な要求を、安全で不公平でプライバシーを侵害するものに変えることができる。しかし、既存のベンチマークは、基本的音声理解、孤立における個々のリスクの研究、あるいは音響的文脈によってのみ問題となるコンテンツに対して本質的に有害な内容の断片化に主に焦点をあてている。我々は、安全、公正、プライバシーの3次元にわたるSLMにおける社会的アライメントを共同評価する最初のベンチマークとして、VoxSafeBenchを紹介した。 Tier1は一致したテキストとオーディオ入力を使用してコンテンツ中心のリスクを評価し、Tier2はテキストが良性だが適切な応答は話者、パラ言語的手がかり、あるいは周囲の環境に依存しているオーディオ条件のリスクをターゲットにしている。 Tier2を検証するために、中間知覚プローブを含め、フロンティアSLMがこれらの音響的手がかりを検出できるが、適切に動作しないことを確認した。バイリンガルカバレッジを持つ22のタスクにおいて、安全意識は話者やシーン条件のリスクに対して低下し、人口動態の違いが声で伝えられるときの公平さは損なわれ、プライバシー保護は文脈的手がかりが音響的に届くと悪化する。現在のSLMはテキスト中の関連する社会的規範をしばしば認識するが、決定的なキューを音声で根拠付けなければならない場合には適用しない。コードとデータは、https://amphionteam.github.io/VoxSafeBench_demopage/で公開されている。

論文の概要: VoxSafeBench: Not Just What Is Said, but Who, How, and Where

関連論文リスト