Fugu-MT 論文翻訳(概要): FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning

論文の概要: FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning

arxiv url: http://arxiv.org/abs/2508.07264v1
Date: Sun, 10 Aug 2025 09:34:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-18 14:51:23.563097
Title: FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning
Title（参考訳）: FLUID:マルチモーダル学習におけるエキスパートスペシャライゼーションのためのトケン蒸留によるフローレイテンシ統合
Authors: Van Duc Cuong, Ta Dinh Tam, Tran Duc Chinh, Nguyen Thi Hanh,
Abstract要約: token Distillation for Expert Components を用いた textscFLUID-Flow-Latent Unified Integration を提案する。 textscFLUID は,(1) emphQ-transforms, 学習可能なクエリトークン, (2) コントラストアライメントによる相互整合を強制する2段階の融合スキーム, (3) 予測時の軽量でロードバランスのMixture-of-Experts の3要素に寄与する。
参考スコア（独自算出の注目度）: 1.912429179274357
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal classification requires robust integration of visual and textual signals, yet common fusion strategies are brittle and vulnerable to modality-specific noise. In this paper, we present \textsc{FLUID}-Flow-Latent Unified Integration via Token Distillation for Expert Specialization, a principled token-level pipeline that improves cross-modal robustness and scalability. \textsc{FLUID} contributes three core elements: (1) \emph{Q-transforms}, learnable query tokens that distill and retain salient token-level features from modality-specific backbones; (2) a two-stage fusion scheme that enforces cross-modal consistency via contrastive alignment and then performs adaptive, task-aware fusion through a gating mechanism and a \emph{Q-bottleneck} that selectively compresses information for downstream reasoning; and (3) a lightweight, load-balanced Mixture-of-Experts at prediction time that enables efficient specialization to diverse semantic patterns. Extensive experiments demonstrate that \textsc{FLUID} attains \(91\%\) accuracy on the GLAMI-1M benchmark, significantly outperforming prior baselines and exhibiting strong resilience to label noise, long-tail class imbalance, and semantic heterogeneity. Targeted ablation studies corroborate both the individual and synergistic benefits of the proposed components, positioning \textsc{FLUID} as a scalable, noise-resilient solution for multimodal product classification.
Abstract（参考訳）: マルチモーダル分類には、視覚信号とテキスト信号の堅牢な統合が必要であるが、一般的な融合戦略は脆く、モダリティ固有のノイズに対して脆弱である。本稿では, クロスモーダルなロバスト性とスケーラビリティを向上するトークンレベルパイプラインである Token Distillation for Expert Specialization を通じて, \textsc{FLUID}-Flow-Latent Unified Integration を提案する。 1 \emph{Q-transforms}、学習可能なクエリトークンで、モダリティ固有のバックボーンから有意なトークンレベルの特徴を抽出し保持する (2) コントラストアライメントを通じて相互整合を強制し、ゲーティング機構を介して適応的でタスク認識の融合を実行する2段階の融合スキーム、そして下流の推論のために情報を選択的に圧縮する \emph{Q-bottleneck} 、(3) 軽量でロードバランスのMixture-of-Experts の予測時に、様々な意味パターンへの効率的な特殊化を可能にする。広範な実験により、GLAMI-1Mベンチマークにおいて、 \textsc{FLUID} が \(91\%\) の精度を達成し、事前ベースラインを著しく上回り、ラベルノイズ、ロングテールクラス不均衡、セマンティックヘテロジニティに対する強い耐性を示すことが示された。ターゲットアブレーション研究は、提案した成分の個人的および相乗効果の両方を相関させ、マルチモーダル製品分類のためのスケーラブルで耐雑音性のあるソリューションとして \textsc{FLUID} を位置づけた。

論文の概要: FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning

関連論文リスト