Fugu-MT 論文翻訳(概要): Music Flamingo: Scaling Music Understanding in Audio Language Models

論文の概要: Music Flamingo: Scaling Music Understanding in Audio Language Models

arxiv url: http://arxiv.org/abs/2511.10289v1
Date: Fri, 14 Nov 2025 01:43:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-14 22:53:22.79909
Title: Music Flamingo: Scaling Music Understanding in Audio Language Models
Title（参考訳）: Music Flamingo: オーディオ言語モデルにおける音楽理解のスケールアップ
Authors: Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang-gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Duraiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, Bryan Catanzaro,
Abstract要約: Music Flamingoは、基礎的なオーディオモデルにおける音楽理解を促進するために設計された、新しい大きなオーディオ言語モデルである。 MF-Skillsはマルチステージパイプラインを通じてラベル付けされたデータセットで、調和、構造、音色、歌詞、文化的な文脈をカバーする豊富なキャプションと質問応答ペアを生成する。 MF-Thinkは音楽理論に基づく新しいチェーン・オブ・シンク・データセットで、続いてGRPOベースの強化学習とカスタム報酬を取り入れた。
参考スコア（独自算出の注目度）: 98.94537017112704
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.
Abstract（参考訳）: 基礎的な音響モデルにおける音楽(歌を含む)の理解を促進するために設計された,新しい大規模音声言語モデルであるMusic Flamingoを紹介する。音声言語の研究は急速に進んでいるが、音楽は動的で層状で情報に富む性質のため、依然として困難である。オープンオーディオ理解モデルのスケーリングが困難であることや、高品質の音楽データやアノテーションの不足などにより、進歩はさらに制限されている。結果として、先行モデルは短い高レベルのキャプションを生成し、表面レベルの質問にのみ答え、様々な音楽文化にまたがる限定的な一般化を示すことに制限される。これらの課題に対処するため、マルチステージパイプラインを通じてラベル付けされた大規模データセットであるMF-Skillsをキュレートし、調和、構造、音色、歌詞、文化的な文脈をカバーした豊富なキャプションと質問応答ペアを生成する。 MFスキルの強化されたAudio Flamingo 3バックボーンを微調整し、音楽理解に関連する複数のスキルを強化する。 MF-Thinkは音楽理論に基づく新しいチェーン・オブ・シント・データセットであり、GRPOをベースとした強化学習とカスタム報酬を用いて学習する。 Music Flamingoは、音楽理解と推論のための10以上のベンチマークで最先端の結果を達成し、汎用的で音楽的にインテリジェントなオーディオ言語モデルとして確立した。 Music Flamingoは、強力な経験的な結果の他に、高度な音楽理解のための新しい標準を設定し、モデルが表面レベルの認識から、階層化された人間的な歌の知覚へとどのように移行できるかを示す。この研究は、コミュニティが人間と同じように有意義な音楽を扱う次世代のモデルを構築するためのベンチマークと基盤を提供すると信じています。

論文の概要: Music Flamingo: Scaling Music Understanding in Audio Language Models

関連論文リスト