Factorized Transport Alignment for Multimodal and Multiview E-commerce Representation Learning
- URL: http://arxiv.org/abs/2512.18117v1
- Date: Fri, 19 Dec 2025 22:50:49 GMT
- Title: Factorized Transport Alignment for Multimodal and Multiview E-commerce Representation Learning
- Authors: Xiwen Chen, Yen-Chieh Lien, Susan Liu, María Castaños, Abolfazl Razi, Xiaoting Zhao, Congzhe Su,
- Abstract summary: We propose a framework that unifies multimodal and multi-view learning through Factorized Transport embedding.<n>During training, the method emphasizes primary views while sampling auxiliary ones, reducing training cost from quadratic in the number of views to constant per item.
- Score: 7.390207354371506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid growth of e-commerce requires robust multimodal representations that capture diverse signals from user-generated listings. Existing vision-language models (VLMs) typically align titles with primary images, i.e., single-view, but overlook non-primary images and auxiliary textual views that provide critical semantics in open marketplaces such as Etsy or Poshmark. To this end, we propose a framework that unifies multimodal and multi-view learning through Factorized Transport, a lightweight approximation of optimal transport, designed for scalability and deployment efficiency. During training, the method emphasizes primary views while stochastically sampling auxiliary ones, reducing training cost from quadratic in the number of views to constant per item. At inference, all views are fused into a single cached embedding, preserving the efficiency of two-tower retrieval with no additional online overhead. On an industrial dataset of 1M product listings and 0.3M interactions, our approach delivers consistent improvements in cross-view and query-to-item retrieval, achieving up to +7.9% Recall@500 over strong multimodal baselines. Overall, our framework bridges scalability with optimal transport-based learning, making multi-view pretraining practical for large-scale e-commerce search.
Related papers
- RecGOAT: Graph Optimal Adaptive Transport for LLM-Enhanced Multimodal Recommendation with Dual Semantic Alignment [23.738860191046538]
We propose RecGOAT, a novel yet simple dual semantic alignment framework for multimodal recommendation.<n>We show that RecGOAT achieves state-of-the-art performance, empirically validating our theoretical insights.
arXiv Detail & Related papers (2026-01-31T11:58:38Z) - Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval [2.0134842677651084]
Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience.<n>We propose a novel method that reverses the logic of typographic attacks by rendering relevant textual content directly onto product images.<n>We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using six state-of-the-art vision foundation models.
arXiv Detail & Related papers (2025-11-07T15:24:18Z) - OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation [74.55725909072903]
We propose a novel multi-modal learning framework, termed OmniSegmentor.<n>Based on ImageNet, we assemble a large-scale dataset for multi-modal pretraining, called ImageNeXt.<n>We introduce a universal multi-modal pretraining framework that consistently amplifies the model's perceptual capabilities across various scenarios.
arXiv Detail & Related papers (2025-09-18T15:52:44Z) - TableDART: Dynamic Adaptive Multi-Modal Routing for Table Understanding [52.59372043981724]
TableDART is a training-efficient framework that integrates multimodal views by reusing pretrained single-modality models.<n>In addition, we propose a novel agent to cross-modal knowledge integration by analyzing outputs from text- and image-based models.
arXiv Detail & Related papers (2025-09-18T07:00:13Z) - VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings [11.209519424876762]
Multimodal learning plays a critical role in e-commerce recommendation platforms today.<n>Existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems.<n>We propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding.
arXiv Detail & Related papers (2025-07-22T23:45:43Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.<n>Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.<n>We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - Federated Multi-View Synthesizing for Metaverse [52.59476179535153]
The metaverse is expected to provide immersive entertainment, education, and business applications.
Virtual reality (VR) transmission over wireless networks is data- and computation-intensive.
We have developed a novel multi-view synthesizing framework that can efficiently provide synthesizing, storage, and communication resources for wireless content delivery in the metaverse.
arXiv Detail & Related papers (2023-12-18T13:51:56Z) - ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs.
Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - CommerceMM: Large-Scale Commerce MultiModal Representation Learning with
Omni Retrieval [30.607369837039904]
CommerceMM is a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to a piece of content.
We propose another 9 novel cross-modal and cross-pair retrieval tasks, called Omni-Retrieval pre-training.
Our model achieves state-of-the-art performance on 7 commerce-related downstream tasks after fine-tuning.
arXiv Detail & Related papers (2022-02-15T08:23:59Z) - Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining [108.86502855439774]
We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval.
We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval.
We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
arXiv Detail & Related papers (2021-07-30T12:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.