Fugu-MT 論文翻訳(概要): GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

論文の概要: GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

arxiv url: http://arxiv.org/abs/2605.00371v1
Date: Fri, 01 May 2026 03:21:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.835343
Title: GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models
Title（参考訳）: GaMMA:大規模マルチモーダルモデルにおけるグローバル・テンポラル音楽の同時理解を目指して
Authors: Zuyao You, Zhesong Yu, Mingyu Liu, Bilei Zhu, Yuan Wan, Zuxuan Wu,
Abstract要約: GaMMAは、包括的な音楽コンテンツ理解を実現するために設計された大型マルチモーダルモデル(LMM)である。オーディオエンコーダをエキスパートの混合方式で組み込むことで、GaMMAは時系列と非時系列の両方の音楽理解タスクを効果的に統合する。当社のアプローチでは、大規模にキュレートされたデータセットとプログレッシブトレーニングパイプラインを組み合わせることで、音楽理解の境界を効果的に推し進める。
参考スコア（独自算出の注目度）: 55.49773230684554
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.
Abstract（参考訳）: 本稿では,総合的な音楽コンテンツ理解を実現するために,最先端(SoTA)大規模マルチモーダルモデル(LMM)であるGaMMAを提案する。 GaMMAはLLaVAの合理化エンコーダデコーダ設計を継承し、音楽と言語間の効果的なクロスモーダル学習を実現する。オーディオエンコーダをエキスパートの混合方式で組み込むことで、GaMMAは、時系列と非時系列の両方の音楽理解タスクを、1組のパラメータで効果的に統一する。提案手法は、大規模にキュレートされたデータセットとプログレッシブトレーニングパイプラインを組み合わせることで、事前学習、教師付き微調整(SFT)、強化学習(RL)による音楽理解の境界を効果的に推し進める。音楽LMMの時間的能力と非時間的能力の両方を包括的に評価するために,音楽理解の多様な側面を網羅した3,739人の人間による複数選択質問を含む,最大の音楽指向ベンチマークであるMusicBenchを紹介した。大規模な実験により、GaMMAは音楽分野での新しいSoTAを確立し、MuchoMusicで79.1%の精度、MusicBench-Temporalで79.3%、MusicBench-Globalで81.3%の精度を達成した。

論文の概要: GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

関連論文リスト