Fugu-MT 論文翻訳(概要): UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

論文の概要: UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

arxiv url: http://arxiv.org/abs/2509.21144v1
Date: Thu, 25 Sep 2025 13:30:46 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 20:58:12.936478
Title: UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice
Title（参考訳）: UniSS: 音声で音声を合成する統一表現型音声翻訳
Authors: Sitong Cheng, Weizhen Bian, Xinsheng Wang, Ruibin Yuan, Jianyi Chen, Shunshun Yin, Yike Guo, Wei Xue,
Abstract要約: 表現型S2STのための新しい単一ステージフレームワークUniSSを紹介する。提案手法は、注意深く設計された音声意味とスタイルモデリングを特徴とする。我々は44.8k時間のデータからなる大規模で高品質な表現型S2STデータセットUniSTをリリースする。
参考スコア（独自算出の注目度）: 33.43869151508715
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo.
Abstract（参考訳）: 表現型音声音声翻訳(S2ST)の最終的な目標は、話者のアイデンティティと感情的スタイルを保ちながら、音声コンテンツを正確に翻訳することである。しかし、この分野での進歩は、表現的なスタイルを保持するペア音声データの不足、多段階処理パイプラインの複雑さ、大規模言語モデル(LLM)からの翻訳能力の制限という3つの大きな課題によって大きく妨げられている。本稿では,S2ST表現のための新しい単一ステージフレームワークUniSSを導入することで,これらの課題に対処する。提案手法は,音声のセマンティクスとスタイルモデリングを慎重に設計し,既存のテキストベースのLLMフレームワークとシームレスに統合し,統一されたテキスト音声言語モデルを構築する。そこで本研究では,テキストから音声への翻訳能力を変換するために,音声意味論を段階的にテキストに整合させ,復号された結果のスタイル保存を確実にするクロスモーダル・チェーン・オブ・シークレット・プロンプトプロセスを提案する。さらに、44.8k時間のデータからなる大規模かつ高品質な表現型S2STデータセットUniSTを構築し、リリースする。実験結果から、UniSSは、音声、感情、持続性を維持しながら、翻訳の忠実度や音声品質において、従来の方法よりも有意に優れていたことが示唆された。我々の研究は、次世代の表現型S2STシステムを構築するための、よりシンプルで効果的なパラダイムを確立します。オーディオサンプルはhttps://cmots.github.io/uniss-demo.comで入手できる。

論文の概要: UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

関連論文リスト