Related papers: UniECS: Unified Multimodal E-Commerce Search Framework with Gated Cross-modal Fusion

UniECS: Unified Multimodal E-Commerce Search Framework with Gated Cross-modal Fusion

URL: http://arxiv.org/abs/2508.13843v1
Date: Tue, 19 Aug 2025 14:06:13 GMT
Title: UniECS: Unified Multimodal E-Commerce Search Framework with Gated Cross-modal Fusion
Authors: Zihan Liang, Yufei Ma, ZhiPeng Qian, Huangyu Dai, Zihan Wang, Ben Chen, Chenyi Lei, Yuqing Ding, Han Li,
Abstract summary: Current e-commerce multimodal retrieval systems face two key limitations.<n>They optimize for specific tasks with fixed modality pairings, and lack comprehensive benchmarks for evaluating unified retrieval approaches.<n>We introduce UniECS, a unified multimodal e-commerce search framework that handles all retrieval scenarios across image, text, and their combinations.
Score: 20.13803245640432
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current e-commerce multimodal retrieval systems face two key limitations: they optimize for specific tasks with fixed modality pairings, and lack comprehensive benchmarks for evaluating unified retrieval approaches. To address these challenges, we introduce UniECS, a unified multimodal e-commerce search framework that handles all retrieval scenarios across image, text, and their combinations. Our work makes three key contributions. First, we propose a flexible architecture with a novel gated multimodal encoder that uses adaptive fusion mechanisms. This encoder integrates different modality representations while handling missing modalities. Second, we develop a comprehensive training strategy to optimize learning. It combines cross-modal alignment loss (CMAL), cohesive local alignment loss (CLAL), intra-modal contrastive loss (IMCL), and adaptive loss weighting. Third, we create M-BEER, a carefully curated multimodal benchmark containing 50K product pairs for e-commerce search evaluation. Extensive experiments demonstrate that UniECS consistently outperforms existing methods across four e-commerce benchmarks with fine-tuning or zero-shot evaluation. On our M-BEER bench, UniECS achieves substantial improvements in cross-modal tasks (up to 28\% gain in R@10 for text-to-image retrieval) while maintaining parameter efficiency (0.2B parameters) compared to larger models like GME-Qwen2VL (2B) and MM-Embed (8B). Furthermore, we deploy UniECS in the e-commerce search platform of Kuaishou Inc. across two search scenarios, achieving notable improvements in Click-Through Rate (+2.74\%) and Revenue (+8.33\%). The comprehensive evaluation demonstrates the effectiveness of our approach in both experimental and real-world settings. Corresponding codes, models and datasets will be made publicly available at https://github.com/qzp2018/UniECS.

Related papers

OneMall: One Architecture, More Scenarios -- End-to-End Generative Recommender Family at Kuaishou E-Commerce [68.7552227901176]
OneMall is an end-to-end generative recommendation framework tailored for e-commerce services at Kuaishou.<n>It unifies the e-commerce's multiple item distribution scenarios, such as Product-card, short-video and live-streaming.<n>OneMall has been deployed, serving over 400 million daily active users at Kuaishou.
arXiv Detail & Related papers (2026-01-29T14:22:39Z)
Token-Level LLM Collaboration via FusionRoute [60.72307345997823]
FusionRoute is a token-level multi-LLM collaboration framework.<n>It selects the most suitable expert at each decoding step and contributes a complementary logit that refines or corrects the selected expert's next-token distribution.<n>It outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning.
arXiv Detail & Related papers (2026-01-08T16:53:16Z)
Optimizing Product Deduplication in E-Commerce with Multimodal Embeddings [0.13999481573773068]
We introduce a scalable, multimodal product deduplication designed specifically for the e-commerce domain.<n>Our approach employs a domain-specific text model grounded in BERT architecture in conjunction with MaskedAutoEncoders for image representations.<n>By integrating these feature extraction mechanisms with Milvus, an optimized vector database, our system can facilitate efficient and high-precision similarity searches.
arXiv Detail & Related papers (2025-09-19T10:49:39Z)
OneSearch: A Preliminary Exploration of the Unified End-to-End Generative Framework for E-commerce Search [43.94443394870866]
OneSearch is the first industrial-deployed end-to-end generative framework for e-commerce search.<n>OneSearch reduces operational expenditure by 75.40% and improves Model FLOPs Utilization from 3.26% to 27.32%.<n>The system has been successfully deployed across multiple search scenarios in Kuaishou, serving millions of users.
arXiv Detail & Related papers (2025-09-03T11:50:04Z)
CROSSAN: Towards Efficient and Effective Adaptation of Multiple Multimodal Foundation Models for Sequential Recommendation [6.013740443562439]
Multimodal Foundation Models (MFMs) excel at representing diverse raw modalities.<n>MFMs' application in sequential recommendation remains largely unexplored.<n>It remains unclear whether we can efficiently adapt multiple (>2) MFMs for the sequential recommendation task.<n>We propose a plug-and-play Cross-modal Side Adapter Network (CROSSAN)
arXiv Detail & Related papers (2025-04-14T15:14:59Z)
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks.<n>By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z)
MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce [42.3177388371158]
Current Embedding-based Retrieval Systems embed queries and items into a shared low-dimensional space. We propose MRSE, a Multi-modality Retrieval System that integrates text, item images, and user preferences. MRSE achieves an 18.9% improvement in offline relevance and a 3.7% gain in online core metrics compared to Shopee's state-of-the-art uni-modality system.
arXiv Detail & Related papers (2024-08-27T11:21:19Z)
An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution. We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture. We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z)
VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias [17.107961913114778]
multimodal misinformation is a growing problem on social media platforms. In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks. We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
arXiv Detail & Related papers (2023-04-27T12:28:29Z)
Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks. We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z)
Learning Similarity Preserving Binary Codes for Recommender Systems [5.799838997511804]
We study an unexplored module combination for the hashing-based recommender systems, namely Compact Cross-Similarity Recommender (CCSR) Inspired by cross-modal retrieval, CCSR utilizes a Posteriori similarity instead of matrix factorization and rating reconstruction to model interactions between users and items. On the MovieLens1M dataset, the absolute performance improvements are up to 15.69% in NDCG and 4.29% in Recall.
arXiv Detail & Related papers (2022-04-18T21:33:59Z)
Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval [80.35589927511667]
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image. We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model. Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
arXiv Detail & Related papers (2021-03-22T15:08:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.