A Remarkably Efficient Paradigm to Multimodal Large Language Models for Sequential Recommendation
- URL: http://arxiv.org/abs/2511.05885v2
- Date: Wed, 12 Nov 2025 01:32:21 GMT
- Title: A Remarkably Efficient Paradigm to Multimodal Large Language Models for Sequential Recommendation
- Authors: Qiyong Zhong, Jiajie Su, Ming Yang, Yunshan Ma, Xiaolin Zheng, Chaochao Chen,
- Abstract summary: Sequential recommendations (SR) predict users' future interactions based on their historical behavior.<n>We propose Speeder, an efficient MLLM-based paradigm for SR featuring three key innovations.<n>Speeder increases training speed to 250% of the original while reducing inference time to 25% on the Amazon dataset.
- Score: 33.469423146286296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequential recommendations (SR) predict users' future interactions based on their historical behavior. The rise of Large Language Models (LLMs) has brought powerful generative and reasoning capabilities, significantly enhancing SR performance, while Multimodal LLMs (MLLMs) further extend this by introducing data like images and interactive relationships. However, critical issues remain, i.e., (a) Suboptimal item representations caused by lengthy and redundant descriptions, leading to inefficiencies in both training and inference; (b) Modality-related cognitive bias, as LLMs are predominantly pretrained on textual data, limiting their ability to effectively integrate and utilize non-textual modalities; (c) Weakening sequential perception in long interaction sequences, where attention mechanisms struggle to capture earlier interactions, hindering the modeling of long-range dependencies. To address these issues, we propose Speeder, an efficient MLLM-based paradigm for SR featuring three key innovations: 1) Multimodal Representation Compression (MRC), which condenses item attributes into concise yet informative tokens, reducing redundancy and computational cost; 2) Modality-aware Progressive Optimization (MPO), enabling gradual learning of multimodal representations; 3) Sequential Position Awareness Enhancement (SPAE), improving the LLM's capability to capture both relative and absolute sequential dependencies in long interaction sequences. Extensive experiments on real-world datasets demonstrate the effectiveness and efficiency of Speeder. Speeder increases training speed to 250% of the original while reducing inference time to 25% on the Amazon dataset.
Related papers
- Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality [59.651410243721045]
CoCoA is a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization.<n>We introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS> embeddings.<n>Experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality.
arXiv Detail & Related papers (2026-03-02T05:34:45Z) - CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension [49.6969505536365]
We propose CREM, with a unified framework that enhances multimodal representations for retrieval while preserving generative ability.<n>CREM achieves state-of-the-art retrieval performance on MMEB while maintaining strong generative performance on multiple comprehension benchmarks.
arXiv Detail & Related papers (2026-02-22T08:09:51Z) - DMESR: Dual-view MLLM-based Enhancing Framework for Multimodal Sequential Recommendation [13.114773060703891]
We propose a Dual-view MLLM-based Enhancing framework for multimodal Sequential Recommendation (DMESR)<n>For the misalignment issue, we employ a contrastive learning mechanism to align the cross-modal semantic representations generated by MLLMs.<n>For the loss of fine-grained semantics, we introduce a cross-attention fusion module that integrates the coarse-grained semantic knowledge obtained from MLLMs with the fine-grained original textual semantics.
arXiv Detail & Related papers (2026-02-14T10:42:56Z) - Cross-Modal Attention Network with Dual Graph Learning in Multimodal Recommendation [12.802844514133255]
Cross-modal Recursive Attention Network with dual graph Embedding (CRANE)<n>We design a core Recursive Cross-Modal Attention (RCA) mechanism that iteratively refines modality features based on cross-correlations in a joint latent space.<n>For symmetric multimodal learning, we explicitly construct users' multimodal profiles by aggregating features of their interacted items.
arXiv Detail & Related papers (2026-01-16T10:09:39Z) - A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z) - Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs [28.752042722391934]
Sequential recommendation (SR) aims to capture users' dynamic interests and sequential patterns based on their historical interactions.<n>MME-SID integrates multimodal embeddings and quantized embeddings to mitigate embedding collapse.<n>Extensive experiments on three public datasets validate the superior performance of MME-SID.
arXiv Detail & Related papers (2025-09-02T07:02:29Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - Transferable Sequential Recommendation with Vanilla Cross-Entropy Loss [2.0048375809706274]
Sequential Recommendation (SR) systems model user preferences by analyzing interaction histories.<n>Current methods incur substantial fine-tuning costs when adapting to new domains.<n>We propose MMM4Rec, a novel multi-modal SR framework that incorporates a dedicated algebraic constraint mechanism for efficient transfer learning.
arXiv Detail & Related papers (2025-06-03T14:18:19Z) - PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection [68.8373788348678]
Visual instruction tuning adapts pre-trained Multimodal Large Language Models to follow human instructions.<n>PRISM is the first training-free framework for efficient visual instruction selection.<n>It reduces the end-to-end time for data selection and model tuning to just 30% of conventional pipelines.
arXiv Detail & Related papers (2025-02-17T18:43:41Z) - Hierarchical Time-Aware Mixture of Experts for Multi-Modal Sequential Recommendation [19.47124940518026]
We propose a Hierarchical time-aware Mixture of experts for multi-modal Sequential Recommendation (HM4SR)<n>First MoE, named Interactive MoE, extracts essential user interest-related information from the multi-modal data of each item.<n>Second MoE, termed Temporal MoE, captures user dynamic interests by introducing explicit temporal embeddings from timestamps in modality encoding.
arXiv Detail & Related papers (2025-01-24T06:26:50Z) - Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs.<n>Specifically, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset.<n>We explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance.
arXiv Detail & Related papers (2024-11-15T18:59:27Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains.
In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z) - STLLM-DF: A Spatial-Temporal Large Language Model with Diffusion for Enhanced Multi-Mode Traffic System Forecasting [32.943673568195315]
We propose the Spatial-Temporal Large Language Model (STLLM-DF) to improve multi-task transportation prediction.
The DDPM's robust denoising capabilities enable it to recover underlying data patterns from noisy inputs.
We show that STLLM-DF consistently outperforms existing models, achieving an average reduction of 2.40% in MAE, 4.50% in RMSE, and 1.51% in MAPE.
arXiv Detail & Related papers (2024-09-08T15:29:27Z) - Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs)
The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference.
The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - SmartTrim: Adaptive Tokens and Attention Pruning for Efficient
Vision-Language Models [35.5601603013045]
We propose SmartTrim, an adaptive acceleration framework for Vision-Language Models (VLMs)
We integrate lightweight modules into the original backbone to identify and prune redundant token representations and attention heads within each layer.
We devise a self-distillation strategy to enhance the consistency between the predictions of the pruned model and its fully-capacity counterpart.
arXiv Detail & Related papers (2023-05-24T11:18:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.