Fugu-MT 論文翻訳(概要): VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

論文の概要: VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

arxiv url: http://arxiv.org/abs/2605.06765v1
Date: Thu, 07 May 2026 17:59:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.523095
Title: VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
Title（参考訳）: VITA-QinYu:ロールプレイングと歌唱のための表現型音声言語モデル
Authors: Jiacheng Xu, Heting Gao, Liufei Xie, Zhenchuan Yang, Lijiang Li, Yiting Chen, Bin Zhang, Meng Chen, Chaoyu Fu, Weifeng Zhao, Wenjiang Zhou,
Abstract要約: ロールプレイングと歌声生成の両方をサポートする最初のエンド・ツー・エンド(E2E)音声言語モデルであるVITAQinYuを提案する。我々は,自然会話,ロールプレイング,歌唱データを15.8K時間合成して訓練を行う。 VITAQinYuは、5ポイントのMOSスケールでピアモデルを0.13ポイント上回り、対物的なロールプレイングベンチマークでピアSLMを7ポイント上回っている。
参考スコア（独自算出の注目度）: 17.32511504880848
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role-playing and singing. We present VITA-QinYu, the first expressive end-to-end (E2E) spoken language model (SLM) that goes beyond natural conversation to support both role-playing and singing generation. VITA-QinYu adopts a hybrid speech-text paradigm that extends interleaved text-audio modeling with multi-codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference. We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role-playing, and singing data for training. VITA-QinYu demonstrates superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role-playing benchmarks, and surpassing peer models by 0.13 points on a 5-point MOS scale for singing. Simultaneously, it achieves state-of-the-art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively. We open-source our code and models and provide an easy-to-use demo with full-stack support for streaming and full-duplex interaction.
Abstract（参考訳）: 人間のスピーチは、人格、気分、演奏要素など、言語的内容を超えた表現力、例えば快適なトーンや歌のハミングなど、私たちはロールプレイングや歌として形式化します。我々は,ロールプレイングと歌声生成の両方をサポートするために,自然会話を超越した最初の表現型エンドツーエンド言語モデル (E2E) であるVITA-QinYuを提案する。 VITA-QinYuは、マルチコードブック音声トークンによるインターリーブテキストオーディオモデリングを拡張したハイブリッド音声テキストパラダイムを採用している。我々はさらに、自然会話、ロールプレイング、およびトレーニングのための歌唱データの合計15.8K時間の総合データ生成パイプラインを開発する。 VITA-QinYuは、優れた表現性を示し、客観的なロールプレイングベンチマークでピアSLMを7ポイント上回り、歌唱のための5ポイントMOSスケールでピアモデルを0.13ポイント上回る。同時に、C3ベンチマークとUROベンチマークでそれぞれ1.38ポイントと4.98ポイントのSLMを上回り、最先端の会話精度と流速を達成する。当社はコードとモデルをオープンソースとして公開し、ストリーミングとフル二重インタラクションをフルスタックでサポートする、使いやすいデモを提供しています。

論文の概要: VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

関連論文リスト