Fugu-MT 論文翻訳(概要): SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling

論文の概要: SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling

arxiv url: http://arxiv.org/abs/2606.03169v1
Date: Tue, 02 Jun 2026 05:27:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 10:57:21.724145
Title: SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling
Title（参考訳）: SketchSong: スケッチプランニングとファイングラインドマルチトラックモデリングによる階層的歌曲生成
Authors: Xiaoyue Duan, Nanxing Hu, Yutang Feng, Xudong Yan, Jiatao Chen, Jinchao Zhang, Jie Zhou,
Abstract要約: SketchSongは階層的な曲生成フレームワークで、曲レベルのスケッチ計画ときめ細かいマルチトラックモデリングを通じて問題に対処する。トラックディメンションに沿って、SketchSongはボーカル、ベース、ドラム、その他の楽器の4つのトラックを明示的にモデル化している。楽曲生成ベンチマークの実験では、SketchSongは客観的な測定値と人間の聴取テストの両方でベースラインを一貫して上回っている。
参考スコア（独自算出の注目度）: 21.874594911334285
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent song generation systems can synthesize realistic audio, yet generating complete songs remains challenging for two reasons. First, explicit song-level arrangement planning remains limited in existing methods, so models often need to organize overall arrangement development while generating low-level audio details. This often leads to incoherence in arrangements, such as weak section transitions and limited dynamic progression. Second, coarse modeling of different musical parts obscures their distinct roles and interactions, limiting arrangement richness of generated songs. In this paper, we present SketchSong, a hierarchical song generation framework that addresses these issues through song-level sketch planning and fine-grained multi-track modeling. Along the temporal dimension, SketchSong first predicts a compact sequence of high-level sketch tokens derived from compressed audio representations, and then generates audio tokens conditioned on these sketches. This coarse-to-fine process gives the model an explicit arrangement plan before detailed audio generation. Along the track dimension, SketchSong explicitly models four tracks, i.e., vocals, bass, drums and other instruments. This enables the model to capture the roles and interactions of different musical parts more precisely. Experiments on song generation benchmarks show that SketchSong consistently outperforms our baseline on both objective metrics and human listening tests. Despite not employing additional post-training for preference optimization such as lyrics and text-prompt alignments, SketchSong achieves competitive results against strong, post-trained open-source systems, demonstrating the effectiveness of our overall design.
Abstract（参考訳）: 最近の曲生成システムは現実的な音声を合成できるが、完全な曲を生成することは2つの理由から困難である。まず、既存の方法では明確な曲レベルのアレンジプランニングが限定されているため、低レベルのオーディオ情報を生成しながら、全体的なアレンジメント開発を組織化する必要があることが多い。これはしばしば、弱い部分遷移や制限された動的進行のような配列の不整合をもたらす。第二に、異なる音楽部分の粗いモデリングは、それぞれの異なる役割と相互作用を曖昧にし、生成された曲の配置の豊かさを制限する。本稿では,これらの問題に対処する階層的な楽曲生成フレームワークであるSketchSongを紹介する。時間次元に沿って、SketchSongはまず圧縮された音声表現から派生した高レベルのスケッチトークンのコンパクトなシーケンスを予測し、次にこれらのスケッチに条件付きオーディオトークンを生成する。この粗大な処理により、詳細なオーディオ生成の前に、モデルを明示的なアレンジメント計画を与える。トラックディメンションに沿って、SketchSongはボーカル、ベース、ドラム、その他の楽器の4つのトラックを明示的にモデル化している。これにより、異なる音楽部品の役割や相互作用をより正確に捉えることができる。楽曲生成ベンチマークの実験では、SketchSongは客観的な測定値と人間の聴取テストの両方でベースラインを一貫して上回っている。歌詞やテキスト・プロンプトのアライメントなど、好みの最適化に追加のトレーニング後最適化を採用していないにも関わらず、SketchSongは、強力なポストトレーニング後のオープンソースシステムに対する競合的な結果を達成し、全体的な設計の有効性を実証している。

論文の概要: SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling

関連論文リスト