Fugu-MT 論文翻訳(概要): StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

論文の概要: StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

arxiv url: http://arxiv.org/abs/2603.06079v1
Date: Fri, 06 Mar 2026 09:30:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.492272
Title: StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation
Title（参考訳）: StreamVoiceAnon+:フレームレベル音響蒸留による感情保存型ストリーミング話者匿名化
Authors: Nikita Kuzmin, Kong Aik Lee, Eng Siong Chng,
Abstract要約: ストリーミング話者匿名化(SA)における感情コンテンツ保存の課題に対処する。音響トークン隠蔽状態におけるフレームレベルの感情蒸留と同一話者からのニュートラル感情発話対を用いた教師付き微調整を提案する。 VoicePrivacy 2024プロトコルでは、49.2%のUAR(感情保存)と5.77%のWER(インテリジェンス)を実現している。
参考スコア（独自算出の注目度）: 56.49717639074325
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant acoustic patterns rather than preserving paralinguistic attributes. We propose supervised finetuning with neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are confined to finetuning, which takes less than 2 hours on 4 GPUs and adds zero inference latency overhead, while maintaining a competitive 180ms streaming latency. On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), a +24% relative UAR improvement over the baseline (39.7%->49.2%) and +10% over the emotion-prompt variant (44.6% UAR), while maintaining strong privacy (EER 49.0%). Demo and code are available: https://anonymous3842031239.github.io/
Abstract（参考訳）: 本稿では,ストリーミング話者匿名化(SA)における感情的コンテンツ保存の課題に対処する。音声継続のために訓練されたニューラルオーディオコーデック言語モデルは、ソースの感情を劣化させる傾向がある。音響トークン隠蔽状態におけるフレームレベルの感情蒸留と同一話者からのニュートラル感情発話対を用いた教師付き微調整を提案する。あらゆる変更は微調整に限定されており、4つのGPUで2時間未満で、競合する180msのストリーミングレイテンシを維持しながら、推論遅延のオーバーヘッドがゼロになる。 VoicePrivacy 2024プロトコルでは、49.2%のUAR(感情保存)が5.77%のWER(インテリジェンス)、+24%のUARがベースライン(39.7%->49.2%)、+10%が感情のプロンプト変異(44.6% UAR)で、強いプライバシ(EER 49.0%)を維持している。デモとコードは https://anonymous3842031239.github.io/

論文の概要: StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

関連論文リスト