Related papers: Visual Product Graph: Bridging Visual Products And Composite Images For End-to-End Style Recommendations

Visual Product Graph: Bridging Visual Products And Composite Images For End-to-End Style Recommendations

URL: http://arxiv.org/abs/2505.21454v1
Date: Tue, 27 May 2025 17:26:55 GMT
Title: Visual Product Graph: Bridging Visual Products And Composite Images For End-to-End Style Recommendations
Authors: Yue Li Du, Ben Alexander, Mikhail Antonenka, Rohan Mahadev, Hao-yu Wu, Dmitry Kislyuk,
Abstract summary: Visual Product Graph (VPG) is an online real-time retrieval system that enables navigation from individual products to composite scenes containing those products, along with complementary recommendations.<n>Our system achieves a 78.8% extremely similar@1 in end-to-end human relevance evaluations, and a 6% module engagement rate.<n>The "Ways to Style It" module, powered by the Visual Product Graph technology, is deployed in production at Pinterest.
Score: 1.130790932059036
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieving semantically similar but visually distinct contents has been a critical capability in visual search systems. In this work, we aim to tackle this problem with Visual Product Graph (VPG), leveraging high-performance infrastructure for storage and state-of-the-art computer vision models for image understanding. VPG is built to be an online real-time retrieval system that enables navigation from individual products to composite scenes containing those products, along with complementary recommendations. Our system not only offers contextual insights by showcasing how products can be styled in a context, but also provides recommendations for complementary products drawn from these inspirations. We discuss the essential components for building the Visual Product Graph, along with the core computer vision model improvements across object detection, foundational visual embeddings, and other visual signals. Our system achieves a 78.8% extremely similar@1 in end-to-end human relevance evaluations, and a 6% module engagement rate. The "Ways to Style It" module, powered by the Visual Product Graph technology, is deployed in production at Pinterest.

Related papers

LaViC: Adapting Large Vision-Language Models to Visually-Aware Conversational Recommendation [24.215914514990004]
LaViC integrates compact image representations into dialogue-based recommendation systems.<n>We construct a new dataset by aligning Reddit conversations with Amazon product listings.<n>LaViC significantly outperforms text-only conversational recommendation methods and open-source vision-language baselines.
arXiv Detail & Related papers (2025-03-30T04:44:13Z)
Piece it Together: Part-Based Concepting with IP-Priors [52.01640707131325]
We introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition.<n>Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+.<n>We also present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task.
arXiv Detail & Related papers (2025-03-13T13:46:10Z)
Multi-modal Generative Models in Recommendation System [34.45328907249946]
Many recommendation systems limit user inputs to text strings or behavior signals such as clicks and purchases. With the advent of generative AI, users have come to expect richer levels of interactions. We argue that future recommendation systems will benefit from a multi-modal understanding of the products.
arXiv Detail & Related papers (2024-09-17T08:55:50Z)
Instruction Tuning-free Visual Token Complement for Multimodal LLMs [51.138806401996696]
multimodal large language models (MLLMs) have promised an elegant bridge between vision and language. We propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features. Our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens.
arXiv Detail & Related papers (2024-08-09T12:13:01Z)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs)<n>Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension.<n>In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts.
arXiv Detail & Related papers (2024-03-29T16:26:20Z)
Efficient Large-Scale Visual Representation Learning And Evaluation [0.13192560874022083]
We describe challenges in e-commerce vision applications at scale and highlight methods to efficiently train, evaluate, and serve visual representations. We present ablation studies evaluating visual representations in several downstream tasks. We include online results from deployed machine learning systems in production on a large scale e-commerce platform.
arXiv Detail & Related papers (2023-05-22T18:25:03Z)
Unified Vision-Language Representation Modeling for E-Commerce Same-Style Products Retrieval [12.588713044749177]
Same-style products retrieval plays an important role in e-commerce platforms. We propose a unified vision-language modeling method for e-commerce same-style products retrieval. It is capable of cross-modal product-to-product retrieval, as well as style transfer and user-interactive search.
arXiv Detail & Related papers (2023-02-10T07:24:23Z)
ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest [60.841761065439414]
At Pinterest, we build a single set of product embeddings called ItemSage to provide relevant recommendations in all shopping use cases. This approach has led to significant improvements in engagement and conversion metrics, while reducing both infrastructure and maintenance cost.
arXiv Detail & Related papers (2022-05-24T02:28:58Z)
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning. We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z)
Exploiting Latent Codes: Interactive Fashion Product Generation, Similar Image Retrieval, and Cross-Category Recommendation using Variational Autoencoders [0.0]
Author proposes using Variational Autoencoder (VAE) to build an interactive fashion product application framework. This pipeline is applicable in the booming industry of e-commerce enabling direct user interaction in specifying desired products.
arXiv Detail & Related papers (2020-09-02T13:27:30Z)
Comprehensive Information Integration Modeling Framework for Video Titling [124.11296128308396]
We integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework. To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization. We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform.
arXiv Detail & Related papers (2020-06-24T10:38:15Z)
Shop The Look: Building a Large Scale Visual Shopping System at Pinterest [16.132346347702075]
Shop The Look is an online shopping discovery service at Pinterest, leveraging visual search to enable users to find and buy products within an image. We discuss topics including core technology across object detection and visual embeddings, serving infrastructure for realtime inference, and data labeling methodology for training/evaluation data collection and human evaluation. The user-facing impacts of our system design choices are measured through offline evaluations, human relevance judgements, and online A/B experiments.
arXiv Detail & Related papers (2020-06-18T21:38:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.