Related papers: MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

URL: http://arxiv.org/abs/2512.07216v1
Date: Mon, 08 Dec 2025 06:55:13 GMT
Title: MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling
Authors: Bin Wu, Feifan Yang, Zhangming Chan, Yu-Ran Gu, Jiawei Feng, Chao Yi, Xiang-Rong Sheng, Han Zhu, Jian Xu, Mang Ye, Bo Zheng,
Abstract summary: We present a systematic analysis of how to leverage multimodal signals across both stages of lifelong modeling framework.<n>We propose MUSE, a simple yet effective multimodal search-based framework.<n>MUSE has been deployed in Taobao display advertising system, enabling 100K-length user behavior sequence modeling.
Score: 48.18456242206804
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Lifelong user interest modeling is crucial for industrial recommender systems, yet existing approaches rely predominantly on ID-based features, suffering from poor generalization on long-tail items and limited semantic expressiveness. While recent work explores multimodal representations for behavior retrieval in the General Search Unit (GSU), they often neglect multimodal integration in the fine-grained modeling stage -- the Exact Search Unit (ESU). In this work, we present a systematic analysis of how to effectively leverage multimodal signals across both stages of the two-stage lifelong modeling framework. Our key insight is that simplicity suffices in the GSU: lightweight cosine similarity with high-quality multimodal embeddings outperforms complex retrieval mechanisms. In contrast, the ESU demands richer multimodal sequence modeling and effective ID-multimodal fusion to unlock its full potential. Guided by these principles, we propose MUSE, a simple yet effective multimodal search-based framework. MUSE has been deployed in Taobao display advertising system, enabling 100K-length user behavior sequence modeling and delivering significant gains in top-line metrics with negligible online latency overhead. To foster community research, we share industrial deployment practices and open-source the first large-scale dataset featuring ultra-long behavior sequences paired with high-quality multimodal embeddings. Our code and data is available at https://taobao-mm.github.io.

Related papers

VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning [22.27364585438247]
VSearcher is a multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments.<n>We introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex multimodal QA questions.<n>We then adopt an SFT-then-RL training pipeline to turn base multimodal models to agent capable of multi-turn tool calling in real-world web environments.
arXiv Detail & Related papers (2026-03-03T09:33:22Z)
Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs [10.443777669301983]
Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval.<n>But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs.<n>We propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding.
arXiv Detail & Related papers (2026-02-05T04:01:01Z)
MISS: Multi-Modal Tree Indexing and Searching with Lifelong Sequential Behavior for Retrieval Recommendation [14.110932722143643]
Large-scale industrial recommendation systems typically employ a two-stage paradigm of retrieval and ranking.<n>We propose Multi-modal Indexing and Searching with lifelong Sequence (MISS), which contains a multi-modal index tree and a multi-modal lifelong sequence modeling module.
arXiv Detail & Related papers (2025-08-20T08:22:02Z)
MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks [50.98856172702256]
We propose the Modality-INformed knowledge Distillation (MIND) framework, a multimodal model compression approach.<n>MIND transfers knowledge from ensembles of pre-trained deep neural networks of varying sizes into a smaller multimodal student.<n>We evaluate MIND on binary and multilabel clinical prediction tasks using time series data and chest X-ray images.
arXiv Detail & Related papers (2025-02-03T08:50:00Z)
M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture [6.928469290518152]
We introduce the Joint-Embedding Predictive Architecture (JEPA) on the multimodal tasks.<n>It converts the input embedding into the output embedding space by a predictor and then conducts the cross-modal alignment on the latent space.<n>We show that M3-JEPA can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in both training and inference.
arXiv Detail & Related papers (2024-09-09T10:40:50Z)
Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance [15.435695491233982]
We propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the Segment Anything Model (SAM) for multi-modal salient object detection (SOD) We develop underlineSAM with seunderlinemantic funderlineeature fuunderlinesion guidancunderlinee (Sammese) In the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. Specifically, in the mask decoder, a semantic-geometric
arXiv Detail & Related papers (2024-08-27T13:47:31Z)
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics. We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z)
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z)
Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Multimodal Channel-Mixing: Channel and Spatial Masked AutoEncoder on Facial Action Unit Detection [12.509298933267225]
This paper presents a novel multi-modal reconstruction network, named Multimodal Channel-Mixing (MCM) as a pre-trained model to learn robust representation for facilitating multi-modal fusion. The approach follows an early fusion setup, integrating a Channel-Mixing module, where two out of five channels are randomly dropped. This module not only reduces channel redundancy, but also facilitates multi-modal learning and reconstruction capabilities, resulting in robust feature learning.
arXiv Detail & Related papers (2022-09-25T15:18:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.