Fugu-MT 論文翻訳(概要): SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

論文の概要: SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

arxiv url: http://arxiv.org/abs/2605.15044v1
Date: Thu, 14 May 2026 16:36:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.952788
Title: SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
Title（参考訳）: SpeakerLLM:話者理解と検証のための話者特化オーディオLLM
Authors: KiHyun Nam, Jungwoo Heo, Siu Bae, Ha-Jin Yu, Joon Son Chung,
Abstract要約: SpeakerLLMは、単一発話話者プロファイリング、記録条件理解、発話対話者比較、自然なインタフェース内でのエビデンス-組織された検証推論を統一する話者特化オーディオ-LLMフレームワークである。 SpeakerLLMは、複数の粒度の話者証拠をキャプチャするために設計された階層型話者トークンライザを使用する。
参考スコア（独自算出の注目度）: 39.96444209787657
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.
Abstract（参考訳）: 音声ファーストエージェントは、物理AI、会話ロボット、スクリーンレスウェアラブルでますます一般的になっているため、オーディオ大言語モデル(オーディオ-LLM)は、ユーザ認証、パーソナライゼーション、コンテキスト認識インタラクションをサポートするために、話者固有の理解を統合する必要がある。これは、誰が話すか、音声がどのように聞こえるか、録音条件が話者の手がかりに与える影響をモデル化する必要がある。従来の話者検証システムは、強いスカラースコアを提供するが言語的な証拠は少ないが、現在のオーディオLLMや話者認識言語モデルは、バイナリラベルや記述プロファイルを超えて話者情報を整理する能力に制限がある。本稿では、単一発話話者プロファイリング、記録条件理解、発話対話者比較、および自然言語インタフェース内でのエビデンス-組織的検証推論を統一する話者特化オーディオ-LLMフレームワークであるSpeakerLLMを提案する。我々は、最終的な同一又は異なる決定からプロファイルレベルの証拠を分離し、記録条件、プロファイルエビデンス、および決定を構造化されたトレースに構成する検証推論ターゲットと決定構成ポリシーを構築した。 SpeakerLLMの中核となるのは、複数の話者証拠を捉えるために設計された階層型話者トークンーザである。発話レベルの話者埋め込みは、アイデンティティとプロファイルレベルのキューを要約するが、フレームレベルの話者は、きめ細かい音響ディスクリプタを保持する。実験の結果,SpeakerLLM-Baseは一般の音声-LLMに対する話者認識と記録条件の理解を改善し,SpeakerLLM-VRは強い生成予測精度を維持し,教師付き検証推論スキーマに基づく決定トレースを生成することがわかった。メタデータに富んだ監視データセットと再現性のためのターゲット・コンストラクション・コードをリリースする。

論文の概要: SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

関連論文リスト