Fugu-MT 論文翻訳(概要): Efficient Training for Cross-lingual Speech Language Models

論文の概要: Efficient Training for Cross-lingual Speech Language Models

arxiv url: http://arxiv.org/abs/2604.11096v1
Date: Mon, 13 Apr 2026 07:12:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.388762
Title: Efficient Training for Cross-lingual Speech Language Models
Title（参考訳）: 言語間言語モデルの効率的な学習
Authors: Yan Zhou, Qingkai Fang, Yun Hong, Yang Feng,
Abstract要約: 言語間言語モデル(CSLM)は,離散音声トークンに基づく言語間言語LLMの効率的な訓練手法である。本稿では, 連続的な事前学習を通じて, モーダルおよび言語間のアライメントを実現する新しいアライメント戦略を提案する。
参考スコア（独自算出の注目度）: 35.512064681474065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Currently, large language models (LLMs) predominantly focus on the text modality. To enable more natural human-AI interaction, speech LLMs are emerging, but building effective end-to-end speech LLMs remains challenging due to limited data and the difficulty in expanding to more languages. In this paper, we introduce Cross-lingual Speech Language Model (CSLM), an efficient training method for cross-lingual speech LLMs based on discrete speech tokens. We propose a novel alignment strategy that achieves cross-modal and cross-lingual alignment through continual pre-training. By conducting instruction fine-tuning following a speech-text interleaved chain-of-modality generation process, we enhance modal alignment at a finer granularity, thereby improving generation quality and reducing latency. CSLM aligns different modalities and languages simultaneously without the need for massive speech data, thus exhibiting good language scalability. Evaluations on cross-modal tasks, mono-lingual conversational tasks, and cross-lingual conversational tasks demonstrate CSLM's strong cross-modal alignment capabilities and general task abilities. (Code is available at: https://github.com/ictnlp/CSLM)
Abstract（参考訳）: 現在、大きな言語モデル(LLM)は、主にテキストのモダリティに焦点を当てている。より自然な人間とAIの対話を可能にするために、LLMは出現しつつあるが、限られたデータとより多くの言語への拡張が難しいため、効果的なエンドツーエンドの音声LLMの構築は依然として困難である。本稿では,離散音声トークンに基づく言語間LLMの効率的な訓練手法であるCSLMを提案する。本稿では, 連続的な事前学習を通じて, モーダルおよび言語間のアライメントを実現する新しいアライメント戦略を提案する。音声文のインターリーブド・チェーン・オブ・モダリティ生成プロセスに従って命令の微調整を行うことで、より微細な粒度でのモーダルアライメントを強化し、生成品質の向上とレイテンシの低減を図る。 CSLMは、大量の音声データを必要とせずに異なるモダリティと言語を同時に調整し、優れた言語スケーラビリティを示す。クロスモーダルなタスク、モノリンガルな会話タスク、およびクロスランガルな会話タスクの評価は、CSLMの強力なクロスモーダルなアライメント能力と一般的なタスク能力を示している。 (https://github.com/ictnlp/CSLM)

論文の概要: Efficient Training for Cross-lingual Speech Language Models

関連論文リスト