Fugu-MT 論文翻訳(概要): MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

論文の概要: MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

arxiv url: http://arxiv.org/abs/2501.06282v1
Date: Fri, 10 Jan 2025 15:55:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-01-14 19:20:12.585911
Title: MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Title（参考訳）: MinMo: シームレス音声対話のためのマルチモーダル大言語モデル
Authors: Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou,
Abstract要約: シームレスな音声対話のためのマルチモーダル大規模言語モデルMinMoを紹介する。我々は、音声テキストから音声へのアライメント、テキストから音声へのアライメント、音声から音声へのアライメント、二重相互作用を通じてMinMoを訓練する。マルチテキストトレーニングの後、MinMoは音声の理解と生成のための様々なベンチマークで最先端のパフォーマンスを実現した。
参考スコア（独自算出の注目度）: 73.39573341265027
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.
Abstract（参考訳）: 大規模言語モデル(LLM)やマルチモーダル音声テキストモデル(Multimodal speech-text model)の最近の進歩は、シームレスな音声対話の基盤となり、リアルタイム、自然、人間のような会話を可能にしている。従来の音声対話モデルは、ネイティブとアライメントに分類されていた。ネイティブモデルは1つのフレームワークで音声とテキスト処理を統合するが、シーケンス長の異なる問題や事前学習の不十分な問題に対処する。調整されたモデルはテキストLLM機能を維持できるが、小さなデータセットや音声タスクに限定されることが多い。本研究では,音声のシームレスな対話を実現するため,約8Bパラメータを持つマルチモーダル大規模言語モデルMinMoを紹介する。我々は,事前整列型マルチモーダルモデルの主な制約に対処する。我々は、140万時間に及ぶ多様な音声データと幅広い音声タスクに基づいて、音声・テキスト・音声アライメント、テキスト・音声アライメント、音声・音声アライメント、二重相互作用アライメントの多段階を通してMinMoを訓練する。マルチステージトレーニングの後、MinMoはテキストLLMの能力を維持しながら音声理解と生成のための様々なベンチマークで最先端のパフォーマンスを達成し、またユーザとシステム間の双方向の同時通信を可能にする。さらに,従来の音声生成モデルよりも優れる新規でシンプルな音声デコーダを提案する。 MinMoの強化された指示追従能力は、感情、方言、発話率など様々なニュアンスを持つユーザ指示に基づく音声生成の制御をサポートし、特定の音声を模倣する。 MinMoの場合、音声からテキストまでのレイテンシは約100ms、完全二重レイテンシは約600ms、実際は800msである。 MinMoプロジェクトのWebページはhttps://funaudiollm.github.io/minmoである。

論文の概要: MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

関連論文リスト