Fugu-MT 論文翻訳(概要): Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

論文の概要: Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

arxiv url: http://arxiv.org/abs/2602.05220v1
Date: Thu, 05 Feb 2026 02:20:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-06 18:49:08.720144
Title: Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions
Title（参考訳）: Bagpiper: リッチキャプションによるオープンエンディングオーディオタスクの解決
Authors: Jinchuan Tian, Haoran Wang, Bo-Hao Su, Chien-yu Huang, Qingzheng Wang, Jiatong Shi, William Chen, Xun Gong, Siddhant Arora, Chin-Jou Li, Masao Someki, Takashi Maekaku, Yusuke Shinohara, Jin Sakuma, Chao-Han Huck Yang, Shinji Watanabe,
Abstract要約: Bagpiperは8Bオーディオ基礎モデルで、リッチキャプションを通じて物理オーディオを解釈する。微調整の間、Bagpiperはタスク固有の前処理なしで多様なタスクを解決するためにキャプション-thenプロセスワークフローを採用している。我々の知る限りでは、Bagpiperは一般的な音声に対する統一的な理解生成を実現する最初の作品の一つである。
参考スコア（独自算出の注目度）: 84.73122243726775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.
Abstract（参考訳）: 現在のオーディオ基礎モデルは、一般に、全体ではなく、独立したオーディオ要因に対処する、厳格でタスク固有の監督に依存している。対照的に、ヒューマンインテリジェンスは、複雑なタスクを実行するために抽象的な認知概念で物理的信号をシームレスにブリッジするオーディオ処理を行う。この哲学を基礎として、8B音声基盤モデルであるBagpiperを紹介した。これは、リッチキャプション(リッチキャプション)を通じて物理オーディオを解釈する、すなわち、信号(例えば、転写、音声イベント)に固有の批判的認知概念をカプセル化する、包括的な自然言語記述である。 600Bトークンの膨大なコーパスを事前学習することにより、モデルは生オーディオと高レベルの概念空間の間の堅牢な双方向マッピングを確立する。微調整の間、Bagpiperはキャプション-thenプロセスワークフローを採用し、タスク固有の前処理なしで多様なタスクを解決するための中間的認知的推論ステップをシミュレートする。実験的に、BagpiperはMMAUとAIRBenchでQwen-2.5-Omniより優れており、音質はCosyVoice3とTangoFluxを上回り、音声、音楽、音響効果の任意の構成を合成することができる。我々の知る限りでは、Bagpiperは一般的な音声に対する統一的な理解生成を実現する最初の作品の一つである。モデル、データ、コードはBagpiper Home Pageで入手できる。

論文の概要: Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

関連論文リスト