Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models
- URL: http://arxiv.org/abs/2510.13359v1
- Date: Wed, 15 Oct 2025 09:46:27 GMT
- Title: Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models
- Authors: Yuki Yada, Sho Akiyama, Ryo Watanabe, Yuta Ueno, Yusuke Shido, Andre Rusli,
- Abstract summary: This study presents the application of a vision-language model (VLM) to product recommendations on Mercari, a major consumer-to-consumer marketplace in Japan.<n>We fine-tuned SigLIP, a VLM employing a sigmoid-based contrastive loss, and developed an image encoder for generating item embeddings used in the recommendation system.
- Score: 0.16419687521433918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: On large-scale e-commerce platforms with tens of millions of active monthly users, recommending visually similar products is essential for enabling users to efficiently discover items that align with their preferences. This study presents the application of a vision-language model (VLM) -- which has demonstrated strong performance in image recognition and image-text retrieval tasks -- to product recommendations on Mercari, a major consumer-to-consumer marketplace used by more than 20 million monthly users in Japan. Specifically, we fine-tuned SigLIP, a VLM employing a sigmoid-based contrastive loss, using one million product image-title pairs from Mercari collected over a three-month period, and developed an image encoder for generating item embeddings used in the recommendation system. Our evaluation comprised an offline analysis of historical interaction logs and an online A/B test in a production environment. In offline analysis, the model achieved a 9.1% improvement in nDCG@5 compared with the baseline. In the online A/B test, the click-through rate improved by 50% whereas the conversion rate improved by 14% compared with the existing model. These results demonstrate the effectiveness of VLM-based encoders for e-commerce product recommendations and provide practical insights into the development of visual similarity-based recommendation systems.
Related papers
- Zero-Shot Retrieval for Scalable Visual Search in a Two-Sided Marketplace [0.0]
This paper presents a scalable visual search system deployed in Mercari's C2C marketplace.<n>We evaluate recent vision-language models for zero-shot image retrieval and compare their performance with an existing fine-tuned baseline.
arXiv Detail & Related papers (2025-07-31T05:13:20Z) - LLM-Enhanced Reranking for Complementary Product Recommendation [1.7149913637404794]
This paper introduces a model-agnostic approach that leverages Large Language Models (LLMs) to enhance the reranking of complementary product recommendations.<n>We demonstrate that our approach effectively balances accuracy and diversity in complementary product recommendations, with at least 50% lift in accuracy metrics and 2% lift in diversity metrics on average for the top recommended items across datasets.
arXiv Detail & Related papers (2025-07-22T05:15:45Z) - Research on E-Commerce Long-Tail Product Recommendation Mechanism Based on Large-Scale Language Models [7.792622257477251]
We propose a novel long-tail product recommendation mechanism that integrates product text descriptions and user behavior sequences using a large-scale language model (LLM)<n>Our work highlights the potential of LLMs in interpreting product content and user intent, offering a promising direction for future e-commerce recommendation systems.
arXiv Detail & Related papers (2025-05-31T19:17:48Z) - On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - CTR-Driven Advertising Image Generation with Multimodal Large Language Models [53.40005544344148]
We explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective.<n>To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL)<n>Our method achieves state-of-the-art performance in both online and offline metrics.
arXiv Detail & Related papers (2025-02-05T09:06:02Z) - Direct Judgement Preference Optimization [79.54459973726405]
We train large language models (LLMs) as generative judges to evaluate and critique other models' outputs.<n>We employ three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective.<n>Our model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
arXiv Detail & Related papers (2024-09-23T02:08:20Z) - MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding [67.26334044239161]
MIND is a framework that infers purchase intentions from multimodal product metadata and prioritizes human-centric ones.
Using Amazon Review data, we create a multimodal intention knowledge base, which contains 1,264,441 million intentions.
Our obtained intentions significantly enhance large language models in two intention comprehension tasks.
arXiv Detail & Related papers (2024-06-15T17:56:09Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - Unlocking the Potential of User Feedback: Leveraging Large Language
Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model.
This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models.
Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z) - ItemSage: Learning Product Embeddings for Shopping Recommendations at
Pinterest [60.841761065439414]
At Pinterest, we build a single set of product embeddings called ItemSage to provide relevant recommendations in all shopping use cases.
This approach has led to significant improvements in engagement and conversion metrics, while reducing both infrastructure and maintenance cost.
arXiv Detail & Related papers (2022-05-24T02:28:58Z) - Personalized Embedding-based e-Commerce Recommendations at eBay [3.1236273633321416]
We present an approach for generating personalized item recommendations in an e-commerce marketplace by learning to embed items and users in the same vector space.
Data ablation is incorporated into the offline model training process to improve the robustness of the production system.
arXiv Detail & Related papers (2021-02-11T17:58:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.