Fugu-MT 論文翻訳(概要): MOSS-Audio Technical Report

論文の概要: MOSS-Audio Technical Report

arxiv url: http://arxiv.org/abs/2606.01802v2
Date: Tue, 02 Jun 2026 08:35:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 18:57:50.471161
Title: MOSS-Audio Technical Report
Title（参考訳）: MOSS-Audioテクニカルレポート
Authors: Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu, Jingqi Chen, Ke Chen, Wenxuan Wang, Yang Wang, Yaozhou Jiang, Yi Jiang, Zhengyuan Lin, Ziqi Chen, Zhaoye Fei, Chenghao Liu, Jun Zhan, Kang Yu, Kexin Huang, Mingshu Chen, Qinyuan Cheng, Ruixiao Li, Shimin Li, Songlin Wang, Yitian Gong, Yang Gao, Yiyang Zhang, Xipeng Qiu,
Abstract要約: MOSS-Audioは、音声、環境音、音楽理解のための統一された音声言語モデルである。音声キャプション、タイムアウェアな質問応答、タイムスタンプによる書き起こし、音声による推論をサポートする。
参考スコア（独自算出の注目度）: 79.99038866101354
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: \textbf{DeepStack cross-layer feature injection}, which exposes the decoder to acoustic information from multiple encoder depths, and \textbf{time markers}, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.
Abstract（参考訳）: MOSS-Audioは、音声、環境音、音楽理解のための統一された音声言語モデルであり、音声キャプション、時間対応質問応答、タイムスタンプによる書き起こし、音声地上推論をサポートする。 MOSS-Audioは12.5Hzの時間表現を生成し、アダプタはデコーダ空間に投影し、デコーダは自動回帰テキスト出力を生成する。複数のエンコーダの深さから音響情報にデコーダを露出する \textbf{DeepStack cross-layer feature Injection} と、オーディオトーケンストリームにタイムスタンプマーカーを挿入することで明確な時間的手がかりを提供する \textbf{time markers} である。データレベルでは、一貫性のあるイベント境界で生音声をセグメンテーションし、音声、音楽、一般音声に分岐固有のアノテーションを適用し、その結果を事前学習のための統一的なキャプションにマージする、イベント保存型オーディオアノテーションパイプラインを設計する。また、タスク指向SFTデータの構築を支援するために、中間ブランチ固有のキャプションを更に保持する。このモデルは、時間的グラウンド化をサポートするために時間を考慮した目標を組み込んだ大規模オーディオ言語データに基づいて事前訓練を行い、その後、複数段階のポストトレーニングを実施して、指示の追従と音声のグラウンド化推論を強化する。 Instruct と Thinking の両構成で 4B と 8B の派生版をリリースしています。 MOSS-Audioは、一般的な音声理解、音声キャプション、ASR、タイムスタンプされたASRを通じて強力なパフォーマンスを達成し、将来の音声エージェントのための有望な理解基盤として位置づけている。

論文の概要: MOSS-Audio Technical Report

関連論文リスト