Fugu-MT 論文翻訳(概要): CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection

論文の概要: CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection

arxiv url: http://arxiv.org/abs/2605.11705v1
Date: Tue, 12 May 2026 07:59:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.684668
Title: CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection
Title（参考訳）: CAST:マルチモーダルコアセット選択のためのマルチスケールトポロジー融合
Authors: Boran Zhao, Hetian Liu, Zhenxian Hu, Yuqing Yuan, Yu Yan, Pengju Ren,
Abstract要約: マルチモーダルコアセット選択のためのCollapse-Aware Multi-Scale Topology fusion frameworkを提案する。まず、画像とテキストのモダリティのトポロジを構築し、局所的なコラプス認識とクロスモーダル融合による統一トポロジを導出する。次に、拡散ウェーブレット領域にマルチスケール分布マッチング基準を導入し、コアセットが元のデータセットを複数のスケールで近似することを奨励する。
参考スコア（独自算出の注目度）: 8.275673045109079
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The training of large multimodal models fundamentally relies on massive image-text datasets, which inevitably incur prohibitive computational overhead. Dataset selection offers a promising paradigm by identifying a highly informative coreset. However, existing approaches suffer from two critical limitations: (i) single-modality-dominated sampling methods, which ignore the fine-grained cross-modal information imbalance inherent in multimodal datasets and thus lead to semantic loss in the other modality; and (ii) coarse-grained sample-scoring-based sampling methods, where the selected coreset tends to be biased toward the scoring model, making it difficult to guarantee distributional equivalence between the coreset and the original dataset. Meanwhile, existing distribution matching and discrete sampling strategies often fail to jointly account for global semantic structure, local fine-grained details, and redundancy-aware coverage in dense regions. To this end, we propose CAST, a Collapse-Aware multi-Scale Topology fusion framework for multimodal coreset selection. We first construct image- and text-modality topologies, and derive a unified topology via local-collapse-aware refinement and cross-modal fusion. We then introduce a multi-scale distribution matching criterion in the diffusion wavelet domain, encouraging the coreset to approximate the original dataset at multiple scales. Finally, we introduce a local soft relational coverage mechanism that extends pure geometric coverage to relation-aware indirect coverage, penalizing redundant selections in dense clusters. Extensive experiments on Flickr30K and MS-COCO show that CAST outperforms existing dataset selection baselines, showcasing great superiority in cross-architecture generalization and energy efficiency over state-of-the-art multimodal synthesis methods.
Abstract（参考訳）: 大規模なマルチモーダルモデルのトレーニングは、基本的には大量の画像テキストデータセットに依存しており、必然的に不規則な計算オーバーヘッドを発生させる。データセットの選択は、非常に有意義なコアセットを特定することによって、有望なパラダイムを提供する。しかし、既存のアプローチには2つの限界がある。一マルチモーダルデータセットに固有の微粒なクロスモーダル情報の不均衡を無視し、それによって他のモーダルのセマンティックな損失をもたらす単モーダル支配サンプリング方法 (II) 粗粒試料抽出法では, 選択されたコアセットがスコアリングモデルに偏りやすい傾向にあり, コアセットと元のデータセットとの分布同値性を保証することが困難である。一方、既存の分布マッチングと離散サンプリング戦略は、大域的な意味構造、局所的なきめ細かい詳細、密度の高い地域での冗長性を考慮したカバレッジを共同で説明できないことが多い。そこで我々は,マルチモーダルコアセット選択のためのCAST(Collapse-Aware Multi-Scale Topology fusion framework)を提案する。まず、画像とテキストのモダリティのトポロジを構築し、局所的なコラプス認識とクロスモーダル融合による統一トポロジを導出する。次に、拡散ウェーブレット領域にマルチスケール分布マッチング基準を導入し、コアセットが元のデータセットを複数のスケールで近似することを奨励する。最後に, 局所的ソフトリレーショナルカバレッジ機構を導入し, 純幾何学的カバレッジを関係認識間接カバレッジに拡張し, 密集クラスタにおける冗長な選択をペナルライズする。 Flickr30KとMS-COCOの大規模な実験により、CASTは既存のデータセット選択ベースラインより優れており、最先端のマルチモーダル合成法よりもクロスアーキテクチャの一般化とエネルギー効率が優れていることが示された。

論文の概要: CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection

関連論文リスト