MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification
- URL: http://arxiv.org/abs/2505.23365v1
- Date: Thu, 29 May 2025 11:42:57 GMT
- Title: MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification
- Authors: Yang Qiao, Xiaoyu Zhong, Xiaofeng Gu, Zhiguo Yu,
- Abstract summary: Multimodal Collaborative Fusion Network (MCFNet) designed for fine-grained classification.<n>MCFNet architecture incorporates a regularized integrated fusion module that improves intra-modal feature representation.<n> multimodal decision classification module exploits inter-modal correlations and unimodal discriminative features.
- Score: 2.7936465461948945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal information processing has become increasingly important for enhancing image classification performance. However, the intricate and implicit dependencies across different modalities often hinder conventional methods from effectively capturing fine-grained semantic interactions, thereby limiting their applicability in high-precision classification tasks. To address this issue, we propose a novel Multimodal Collaborative Fusion Network (MCFNet) designed for fine-grained classification. The proposed MCFNet architecture incorporates a regularized integrated fusion module that improves intra-modal feature representation through modality-specific regularization strategies, while facilitating precise semantic alignment via a hybrid attention mechanism. Additionally, we introduce a multimodal decision classification module, which jointly exploits inter-modal correlations and unimodal discriminative features by integrating multiple loss functions within a weighted voting paradigm. Extensive experiments and ablation studies on benchmark datasets demonstrate that the proposed MCFNet framework achieves consistent improvements in classification accuracy, confirming its effectiveness in modeling subtle cross-modal semantics.
Related papers
- Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality [59.651410243721045]
CoCoA is a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization.<n>We introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS> embeddings.<n>Experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality.
arXiv Detail & Related papers (2026-03-02T05:34:45Z) - From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z) - Modality-Specific Enhancement and Complementary Fusion for Semi-Supervised Multi-Modal Brain Tumor Segmentation [6.302779966909783]
We propose a novel semi-supervised multi-modal framework for medical image segmentation.<n>We introduce a Modality-specific Enhancing Module (MEM) to strengthen semantic unique cues to each modality.<n>We also introduce a learnable Complementary Information Fusion (CIF) module to adaptively exchange complementary knowledge between modalities.
arXiv Detail & Related papers (2025-12-10T16:15:17Z) - Decoupled Multimodal Fusion for User Interest Modeling in Click-Through Rate Prediction [6.663141182602147]
We propose Decoupled Multimodal Fusion (DMF) to enable fine-grained interactions between ID-based collaborative representations and multimodal representations for user interest modeling.<n>We construct target-aware features to bridge the semantic gap across different embedding spaces and leverage them as side information to enhance the effectiveness of user interest modeling.<n>DMF has been deployed on the product recommendation system of the international e-commerce platform, achieving relative improvements of 5.30% in CTCVR and 7.43% in GMV with negligible computational overhead.
arXiv Detail & Related papers (2025-10-13T07:06:26Z) - Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation [9.91438130100011]
MambaRec is a novel framework that integrates local feature alignment and global distribution regularization.<n>DREAM module captures hierarchical relationships and context-aware associations, improving cross-modal semantic modeling.<n>Experiments on real-world e-commerce datasets show that MambaRec outperforms existing methods in fusion quality, generalization, and efficiency.
arXiv Detail & Related papers (2025-09-11T02:52:26Z) - Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets [22.03742325512164]
We propose the practical setting termed Multi-Modal Heterogeneous Category-set Learning (MMHCL)<n>Our method significantly outperforms existing state-of-the-art approaches on multiple benchmark datasets.
arXiv Detail & Related papers (2025-06-11T13:49:22Z) - BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation [55.486872677160015]
We reformulate multi-modal semantic segmentation as a mask-level classification task.<n>We propose BiXFormer, which integrates Unified Modality Matching (UMM) and Cross Modality Alignment (CMA)<n> Experiments on both synthetic and real-world multi-modal benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2025-06-04T08:04:58Z) - Representation Learning with Mutual Influence of Modalities for Node Classification in Multi-Modal Heterogeneous Networks [16.669479456576322]
We propose a novel model for node classification in MMHNs, named Heterogeneous Graph Neural Network with Inter-Modal Attention (HGNN-IMA)<n>In this paper, we propose a novel model for node classification in MMHNs, named Heterogeneous Graph Neural Network with Inter-Modal Attention (HGNN-IMA)
arXiv Detail & Related papers (2025-05-12T02:59:46Z) - DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning [7.947217265041953]
Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities.<n>We introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features.<n>Our experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T21:47:48Z) - M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification [23.322598623627222]
M$3$amba is a novel end-to-end CLIP-driven Mamba model for multi-modal fusion.<n>We introduce CLIP-driven modality-specific adapters to achieve a comprehensive semantic understanding of different modalities.<n>Experiments have shown that M$3$amba has an average performance improvement of at least 5.98% compared with the state-of-the-art methods.
arXiv Detail & Related papers (2025-03-09T05:06:47Z) - MTPareto: A MultiModal Targeted Pareto Framework for Fake News Detection [34.09249215878179]
Multimodal fake news detection is essential for maintaining the authenticity of Internet multimedia information.<n>To address this problem, we propose the MTPareto framework to optimize multimodal fusion.<n>Experiment results on FakeSV and FVC datasets show that the proposed framework outperforms baselines.
arXiv Detail & Related papers (2025-01-12T10:14:29Z) - Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
We propose a novel category-adaptive cross-modal semantic refinement and transfer (C$2$SRT) framework to explore the semantic correlation.<n>The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module.<n>Experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$2$SRT framework outperforms current state-of-the-art algorithms.
arXiv Detail & Related papers (2024-12-09T04:00:18Z) - Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet)
AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition.
Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - Deep Multimodal Fusion by Channel Exchanging [87.40768169300898]
This paper proposes a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities.
The validity of such exchanging process is also guaranteed by sharing convolutional filters yet keeping separate BN layers across modalities, which, as an add-on benefit, allows our multimodal architecture to be almost as compact as a unimodal network.
arXiv Detail & Related papers (2020-11-10T09:53:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.