Related papers: FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning

FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning

URL: http://arxiv.org/abs/2508.07264v1
Date: Sun, 10 Aug 2025 09:34:17 GMT
Title: FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning
Authors: Van Duc Cuong, Ta Dinh Tam, Tran Duc Chinh, Nguyen Thi Hanh,
Abstract summary: We present textscFLUID-Flow-Latent Unified Integration via Token Distillation for Expert components.<n>textscFLUID contributes three core elements: (1) emphQ-transforms, learnable query tokens that distill and retain salient token-level features from modality-specific backbones; (2) a two-stage fusion scheme that enforces cross-modal consistency via contrastive alignment; and (3) a lightweight, load-balanced Mixture-of-Experts at prediction time.
Score: 1.912429179274357
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal classification requires robust integration of visual and textual signals, yet common fusion strategies are brittle and vulnerable to modality-specific noise. In this paper, we present \textsc{FLUID}-Flow-Latent Unified Integration via Token Distillation for Expert Specialization, a principled token-level pipeline that improves cross-modal robustness and scalability. \textsc{FLUID} contributes three core elements: (1) \emph{Q-transforms}, learnable query tokens that distill and retain salient token-level features from modality-specific backbones; (2) a two-stage fusion scheme that enforces cross-modal consistency via contrastive alignment and then performs adaptive, task-aware fusion through a gating mechanism and a \emph{Q-bottleneck} that selectively compresses information for downstream reasoning; and (3) a lightweight, load-balanced Mixture-of-Experts at prediction time that enables efficient specialization to diverse semantic patterns. Extensive experiments demonstrate that \textsc{FLUID} attains \(91\%\) accuracy on the GLAMI-1M benchmark, significantly outperforming prior baselines and exhibiting strong resilience to label noise, long-tail class imbalance, and semantic heterogeneity. Targeted ablation studies corroborate both the individual and synergistic benefits of the proposed components, positioning \textsc{FLUID} as a scalable, noise-resilient solution for multimodal product classification.

Related papers

Multimodal Mixture-of-Experts with Retrieval Augmentation for Protein Active Site Identification [35.29329758342847]
We introduce Multimodal Mixture-of-Experts with Retrieval Augmentation (MERA), the first retrieval-augmented framework for protein active site identification.<n>We show that MERA achieves state-of-the-art performance, with 90% AUPRC on active site prediction and significant gains on peptide-binding site identification.
arXiv Detail & Related papers (2026-03-02T06:40:04Z)
SKANet: A Cognitive Dual-Stream Framework with Adaptive Modality Fusion for Robust Compound GNSS Interference Classification [47.20483076887704]
Global Navigation Satellite Systems (GNSS) face growing threats from sophisticated jamming interference.<n>We propose a cognitive deep learning framework built upon a dual-stream architecture that integrates Time-Frequency Images (TFIs) and Power Spectral Density (PSD)<n>We show that SKANet achieves an overall accuracy of 96.99%, exhibiting superior robustness for compound jamming classification.
arXiv Detail & Related papers (2026-01-19T07:42:45Z)
Dual-granularity Sinkhorn Distillation for Enhanced Learning from Long-tailed Noisy Data [67.25796812343454]
Real-world datasets for deep learning frequently suffer from the co-occurring challenges of class imbalance and label noise.<n>We propose Dual-granularity Sinkhorn Distillation (D-SINK), a novel framework that enhances dual robustness by distilling and integrating complementary insights.<n>Experiments on benchmark datasets demonstrate that D-SINK significantly improves robustness and achieves strong empirical performance in learning from long-tailed noisy data.
arXiv Detail & Related papers (2025-10-09T13:05:27Z)
Beyond Simple Fusion: Adaptive Gated Fusion for Robust Multimodal Sentiment Analysis [27.11612547025828]
We introduce textbfAdaptive textbfGated textbfFusion textbfNetwork that adaptively adjusts feature weights based on information entropy and modality importance.<n>Experiments on CMU-MOSI and CMU-MOSEI show that AGFN significantly outperforms strong baselines in accuracy, effectively discerning subtle emotions with robust performance.
arXiv Detail & Related papers (2025-10-02T05:05:41Z)
Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition [54.44798086835314]
Speech emotion recognition (SER) plays a critical role in building emotion-aware speech systems, but its performance degrades significantly under noisy conditions.<n>We propose the Sparse Mixture-of-Experts Representation Integration Technique (Sparse MERIT), a flexible MTL framework that applies frame-wise expert routing over self-supervised speech representations.<n> Experiments on the MSP-Podcast corpus show that Sparse MERIT consistently outperforms baseline models on both SER and SE tasks.
arXiv Detail & Related papers (2025-09-10T10:18:56Z)
FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [50.438552588818]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z)
Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization [66.10528870853324]
Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks is critically important.<n>One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities.<n>We propose a plug-and-play regularization term based on functional entropy, which introduces no additional parameters.
arXiv Detail & Related papers (2025-05-10T12:58:15Z)
Spectrum-based Modality Representation Fusion Graph Convolutional Network for Multimodal Recommendation [7.627299398469962]
We propose a new Spectrum-based Modality Representation graph recommender.<n>It aims to capture both uni-modal and fusion preferences while simultaneously suppressing modality noise.<n>Experiments on three real-world datasets show the efficacy of our proposed model.
arXiv Detail & Related papers (2024-12-19T15:53:21Z)
Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations. We introduce a predictive self-attention module to capture reliable context dynamics within modalities. A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities. A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z)
Deep Common Feature Mining for Efficient Video Semantic Segmentation [25.851900402539467]
We present Deep Common Feature Mining (DCFM) for video semantic segmentation.<n>DCFM explicitly decomposes features into two complementary components.<n>We incorporate a self-supervised loss function to reinforce intra-class feature similarity and enhance temporal consistency.
arXiv Detail & Related papers (2024-03-05T06:17:59Z)
Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features. Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z)
Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations. Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.