Related papers: CROSSAN: Towards Efficient and Effective Adaptation of Multiple Multimodal Foundation Models for Sequential Recommendation

CROSSAN: Towards Efficient and Effective Adaptation of Multiple Multimodal Foundation Models for Sequential Recommendation

URL: http://arxiv.org/abs/2504.10307v1
Date: Mon, 14 Apr 2025 15:14:59 GMT
Title: CROSSAN: Towards Efficient and Effective Adaptation of Multiple Multimodal Foundation Models for Sequential Recommendation
Authors: Junchen Fu, Yongxin Ni, Joemon M. Jose, Ioannis Arapakis, Kaiwen Zheng, Youhua Li, Xuri Ge,
Abstract summary: Multimodal Foundation Models (MFMs) excel at representing diverse raw modalities.<n>MFMs' application in sequential recommendation remains largely unexplored.<n>It remains unclear whether we can efficiently adapt multiple (>2) MFMs for the sequential recommendation task.<n>We propose a plug-and-play Cross-modal Side Adapter Network (CROSSAN)
Score: 6.013740443562439
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Foundation Models (MFMs) excel at representing diverse raw modalities (e.g., text, images, audio, videos, etc.). As recommender systems increasingly incorporate these modalities, leveraging MFMs to generate better representations has great potential. However, their application in sequential recommendation remains largely unexplored. This is primarily because mainstream adaptation methods, such as Fine-Tuning and even Parameter-Efficient Fine-Tuning (PEFT) techniques (e.g., Adapter and LoRA), incur high computational costs, especially when integrating multiple modality encoders, thus hindering research progress. As a result, it remains unclear whether we can efficiently and effectively adapt multiple (>2) MFMs for the sequential recommendation task. To address this, we propose a plug-and-play Cross-modal Side Adapter Network (CROSSAN). Leveraging the fully decoupled side adapter-based paradigm, CROSSAN achieves high efficiency while enabling cross-modal learning across diverse modalities. To optimize the final stage of multimodal fusion across diverse modalities, we adopt the Mixture of Modality Expert Fusion (MOMEF) mechanism. CROSSAN achieves superior performance on the public datasets for adapting four foundation models with raw modalities. Performance consistently improves as more MFMs are adapted. We will release our code and datasets to facilitate future research.

Related papers

MokA: Multimodal Low-Rank Adaptation for MLLMs [11.440424554587674]
Multimodal low-rank Adaptation (MokA) is a multimodal-aware efficient fine-tuning strategy.<n>MokA compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction.
arXiv Detail & Related papers (2025-06-05T16:04:08Z)
Enhancing Multi-modal Models with Heterogeneous MoE Adapters for Fine-tuning [3.8984478257737734]
Multi-modal models excel in cross-modal tasks but are computationally expensive due to their billions of parameters.<n>Existing methods primarily focus on uni-modal processing, overlooking the critical modal fusion needed for multi-modal tasks.<n>We propose a mixture of experts that extend the traditional PEFT framework to support multi-modal expert combinations and improve information interaction.
arXiv Detail & Related papers (2025-03-26T15:26:18Z)
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs.<n>Specifically, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset.<n>We explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance.
arXiv Detail & Related papers (2024-11-15T18:59:27Z)
LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities. PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z)
Unleashing the Potential of Multi-Channel Fusion in Retrieval for Personalized Recommendations [33.79863762538225]
A key challenge in Recommender systems (RS) is efficiently processing vast item pools to deliver highly personalized recommendations under strict latency constraints. In this paper, we explore advanced channel fusion strategies by assigning systematically optimized weights to each channel. Our methods enhance both personalization and flexibility, achieving significant performance improvements across multiple datasets and yielding substantial gains in real-world deployments.
arXiv Detail & Related papers (2024-10-21T14:58:38Z)
Train Once, Deploy Anywhere: Matryoshka Representation Learning for Multimodal Recommendation [27.243116376164906]
We introduce a lightweight framework called full-scale Matryoshka representation learning for multimodal recommendation (fMRLRec) Our fMRLRec captures item features at different granularities, learning informative representations for efficient recommendation across multiple dimensions. We demonstrate the effectiveness and efficiency of fMRLRec on multiple benchmark datasets.
arXiv Detail & Related papers (2024-09-25T05:12:07Z)
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics. We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z)
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z)
MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications. Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z)
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.<n>We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.<n>We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z)
Dynamic Multimodal Fusion [8.530680502975095]
Dynamic multimodal fusion (DynMM) is a new approach that adaptively fuses multimodal data and generates data-dependent forward paths during inference. Results on various multimodal tasks demonstrate the efficiency and wide applicability of our approach.
arXiv Detail & Related papers (2022-03-31T21:35:13Z)
Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis [16.32509144501822]
We propose a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual Information (MI) in unimodal input pairs. The framework is jointly trained with the main task (MSA) to improve the performance of the downstream MSA task.
arXiv Detail & Related papers (2021-09-01T14:45:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.