Related papers: Adapting Vision-Language Models for E-commerce Understanding at Scale

Adapting Vision-Language Models for E-commerce Understanding at Scale

URL: http://arxiv.org/abs/2602.11733v1
Date: Thu, 12 Feb 2026 08:59:22 GMT
Title: Adapting Vision-Language Models for E-commerce Understanding at Scale
Authors: Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed, Yannick Versley, Shahram Khadivi,
Abstract summary: General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling.<n>We show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance.<n>We propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.
Score: 36.93444961629752
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

Related papers

Beyond Language Modeling: An Exploration of Multimodal Pretraining [125.34714978184638]
We provide empirical clarity through controlled, from-scratch pretraining experiments.<n>We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision.<n>We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language.
arXiv Detail & Related papers (2026-03-03T18:58:00Z)
MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding [11.989986738179427]
MOON2.0 is a dynamic modality-balanced representation learning framework for e-commerce product understanding.<n>MoE module adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning.<n> MBE2.0 is a co-augmented multimodal representation benchmark for e-commerce representation learning and evaluation.
arXiv Detail & Related papers (2025-11-16T04:29:35Z)
Multi-Modal Interpretability for Enhanced Localization in Vision-Language Models [2.984679075401059]
This paper presents the Multi-Modal Explainable Learning framework, designed to enhance the interpretability of vision-language models.<n>Our approach processes features at multiple semantic levels to capture relationships between image regions at different granularities.<n>We show that by incorporating semantic relationship information into gradient-based attribution maps, MMEL produces more focused and contextually aware visualizations.
arXiv Detail & Related papers (2025-09-17T18:18:59Z)
EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-Commerce Models [16.801877795951572]
E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details.<n>This raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance?<n>We introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images.
arXiv Detail & Related papers (2025-08-21T17:01:12Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.<n>EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z)
Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z)
A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation [47.70824723223262]
We propose a new setting for generating product descriptions from images, augmented by marketing keywords. We present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference. Experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods.
arXiv Detail & Related papers (2024-02-21T07:38:29Z)
MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization [93.5217515566437]
Multi-modal Product Summarization (MPS) aims to increase customers' desire to purchase by highlighting product characteristics. Existing MPS methods can produce promising results, but they still lack end-to-end product summarization. We propose an end-to-end multi-modal attribute-aware product summarization method (MMAPS) for generating high-quality product summaries in e-commerce.
arXiv Detail & Related papers (2023-08-22T11:00:09Z)
PUMGPT: A Large Vision-Language Model for Product Understanding [18.70740237744492]
PumGPT is the first e-commerce specialized LVLM designed for multimodal product understanding tasks. Our experiments show that PumGPT outperforms five other open-source LVLMs and GPT-4V in product understanding tasks.
arXiv Detail & Related papers (2023-08-18T14:01:37Z)
Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce [35.73830796500975]
We propose an instance-centric multi-modal pretraining paradigm called ECLIP in this work. To enable the model to focus on the desired product instance without reliance on expensive manual annotations, two specially configured pretext tasks are proposed. ECLIP surpasses existing methods by a large margin on a broad range of downstream tasks, demonstrating the strong transferability to real-world E-commerce applications.
arXiv Detail & Related papers (2023-04-06T04:14:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.