Related papers: Multimodal Fusion And Sparse Attention-based Alignment Model for Long Sequential Recommendation

Multimodal Fusion And Sparse Attention-based Alignment Model for Long Sequential Recommendation

URL: http://arxiv.org/abs/2508.09664v1
Date: Wed, 13 Aug 2025 09:50:44 GMT
Title: Multimodal Fusion And Sparse Attention-based Alignment Model for Long Sequential Recommendation
Authors: Yongrui Fu, Jian Liu, Tao Li, Zonggang Wu, Shouke Qin, Hanmeng Liu,
Abstract summary: multimodal item sequences and mining multi-grained user interests can bridge the gap between content comprehension and recommendation.<n>We propose MUFASA, a MUltimodal Fusion And Sparse Attention-based Alignment model for long sequential recommendation.<n>Experiments on real-world benchmarks show that MUFASA consistently surpasses state-of-the-art baselines.
Score: 9.086257183699418
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in multimodal recommendation enable richer item understanding, while modeling users' multi-scale interests across temporal horizons has attracted growing attention. However, effectively exploiting multimodal item sequences and mining multi-grained user interests to substantially bridge the gap between content comprehension and recommendation remain challenging. To address these issues, we propose MUFASA, a MUltimodal Fusion And Sparse Attention-based Alignment model for long sequential recommendation. Our model comprises two core components. First, the Multimodal Fusion Layer (MFL) leverages item titles as a cross-genre semantic anchor and is trained with a joint objective of four tailored losses that promote: (i) cross-genre semantic alignment, (ii) alignment to the collaborative space for recommendation, (iii) preserving the similarity structure defined by titles and preventing modality representation collapse, and (iv) distributional regularization of the fusion space. This yields high-quality fused item representations for further preference alignment. Second, the Sparse Attention-guided Alignment Layer (SAL) scales to long user-behavior sequences via a multi-granularity sparse attention mechanism, which incorporates windowed attention, block-level attention, and selective attention, to capture user interests hierarchically and across temporal horizons. SAL explicitly models both the evolution of coherent interest blocks and fine-grained intra-block variations, producing robust user and item representations. Extensive experiments on real-world benchmarks show that MUFASA consistently surpasses state-of-the-art baselines. Moreover, online A/B tests demonstrate significant gains in production, confirming MUFASA's effectiveness in leveraging multimodal cues and accurately capturing diverse user preferences.

Related papers

DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation [0.0]
DeepInterestGR introduces three key innovations in generative recommendation framework.<n>We leverage multi-LLM Interest Mining, Reward-Labeled Deep Interest, and Interest-Enhanced Item Discretization.<n> Experiments on three Amazon Review benchmarks demonstrate that DeepInterestGR consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-21T17:03:06Z)
GEMs: Breaking the Long-Sequence Barrier in Generative Recommendation with a Multi-Stream Decoder [54.64137490632567]
We propose a novel and unified framework designed to capture users' sequences from long-term history.<n>Generative Multi-streamers ( GEMs) break user sequences into three streams.<n>Extensive experiments on large-scale industrial datasets demonstrate that GEMs significantly outperforms state-the-art methods in recommendation accuracy.
arXiv Detail & Related papers (2026-02-14T06:42:56Z)
Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals [17.608491612845306]
Sequential recommender systems rank relevant items by modeling a user's interaction history and computing the inner product between the resulting user representation and stored item embeddings.<n>To avoid the significant memory overhead of storing large item sets, the generative recommendation paradigm instead models each item as a series of discrete semantic codes.<n>These methods have yet to surpass traditional sequential recommenders on large item sets, limiting their adoption in the very scenarios they were designed to address.<n>We propose MSCGRec, a Multimodal Semantic and Collaborative Generative Recommender.
arXiv Detail & Related papers (2026-02-03T16:39:35Z)
Cross-Modal Attention Network with Dual Graph Learning in Multimodal Recommendation [12.802844514133255]
Cross-modal Recursive Attention Network with dual graph Embedding (CRANE)<n>We design a core Recursive Cross-Modal Attention (RCA) mechanism that iteratively refines modality features based on cross-correlations in a joint latent space.<n>For symmetric multimodal learning, we explicitly construct users' multimodal profiles by aggregating features of their interacted items.
arXiv Detail & Related papers (2026-01-16T10:09:39Z)
Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process [55.91649771370862]
Dirichlet process (DP) mixture model is a powerful non-parametric method that can amplify the most prominent features.<n>We propose a new DP-driven multimodal learning framework that automatically achieves an optimal balance between prominent intra-modal representation learning and cross-modal alignment.
arXiv Detail & Related papers (2025-10-23T16:53:24Z)
Progressive Semantic Residual Quantization for Multimodal-Joint Interest Modeling in Music Recommendation [6.790539226766362]
We propose a novel multimodal recommendation framework with two stages.<n>In the first stage, our method generates modal-specific and modal-joint semantic IDs.<n>In the second stage, to model multimodal interest of users, a Multi-Codebook Cross-Attention network is designed.
arXiv Detail & Related papers (2025-08-28T02:16:57Z)
Distribution-Guided Auto-Encoder for User Multimodal Interest Cross Fusion [3.5015430462759936]
This paper proposes the Distribution-Guided Multimodal-Interest Auto-Encoder (DMAE), which achieves the cross fusion of user multimodal interest at the behavioral level.
arXiv Detail & Related papers (2025-08-20T07:21:27Z)
FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [50.438552588818]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z)
LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation [19.654959889052638]
Cross-Domain Sequential Recommendation (CDSR) predicts user behavior by leveraging historical interactions across multiple domains.<n>We propose LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation (LLM-EMF)<n>LLM-EMF is a novel and advanced approach that enhances textual information with Large Language Models (LLM) knowledge.
arXiv Detail & Related papers (2025-06-22T09:53:21Z)
Hierarchical Time-Aware Mixture of Experts for Multi-Modal Sequential Recommendation [19.47124940518026]
We propose a Hierarchical time-aware Mixture of experts for multi-modal Sequential Recommendation (HM4SR)<n>First MoE, named Interactive MoE, extracts essential user interest-related information from the multi-modal data of each item.<n>Second MoE, termed Temporal MoE, captures user dynamic interests by introducing explicit temporal embeddings from timestamps in modality encoding.
arXiv Detail & Related papers (2025-01-24T06:26:50Z)
Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations. We introduce a predictive self-attention module to capture reliable context dynamics within modalities. A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities. A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z)
BiVRec: Bidirectional View-based Multimodal Sequential Recommendation [55.87443627659778]
We propose an innovative framework, BivRec, that jointly trains the recommendation tasks in both ID and multimodal views. BivRec achieves state-of-the-art performance on five datasets and showcases various practical advantages.
arXiv Detail & Related papers (2024-02-27T09:10:41Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition. It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making. Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z)
Multiple Interest and Fine Granularity Network for User Modeling [3.508126539399186]
User modeling plays a fundamental role in industrial recommender systems, either in the matching stage and the ranking stage, in terms of both the customer experience and business revenue. Most existing deep-learning based approaches exploit item-ids and category-ids but neglect fine-grained features like color and mate-rial, which hinders modeling the fine granularity of users' interests. We present Multiple interest and Fine granularity Net-work (MFN), which tackle users' multiple and fine-grained interests and construct the model from both the similarity relationship and the combination relationship among the users' multiple interests.
arXiv Detail & Related papers (2021-12-05T15:12:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.