Cross-Domain Product Representation Learning for Rich-Content E-Commerce
        - URL: http://arxiv.org/abs/2308.05550v1
- Date: Thu, 10 Aug 2023 13:06:05 GMT
- Title: Cross-Domain Product Representation Learning for Rich-Content E-Commerce
- Authors: Xuehan Bai, Yan Li, Yanhua Cheng, Wenjie Yang, Quan Chen, Han Li
- Abstract summary: This paper introduces a large-scale cRoss-dOmain Product Ecognition dataset, called ROPE.
ROPE covers a wide range of product categories and contains over 180,000 products, corresponding to millions of short videos and live streams.
It is the first dataset to cover product pages, short videos, and live streams simultaneously, providing the basis for establishing a unified product representation across different media domains.
- Score: 16.418118040661646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   The proliferation of short video and live-streaming platforms has
revolutionized how consumers engage in online shopping. Instead of browsing
product pages, consumers are now turning to rich-content e-commerce, where they
can purchase products through dynamic and interactive media like short videos
and live streams. This emerging form of online shopping has introduced
technical challenges, as products may be presented differently across various
media domains. Therefore, a unified product representation is essential for
achieving cross-domain product recognition to ensure an optimal user search
experience and effective product recommendations. Despite the urgent industrial
need for a unified cross-domain product representation, previous studies have
predominantly focused only on product pages without taking into account short
videos and live streams. To fill the gap in the rich-content e-commerce area,
in this paper, we introduce a large-scale cRoss-dOmain Product Ecognition
dataset, called ROPE. ROPE covers a wide range of product categories and
contains over 180,000 products, corresponding to millions of short videos and
live streams. It is the first dataset to cover product pages, short videos, and
live streams simultaneously, providing the basis for establishing a unified
product representation across different media domains. Furthermore, we propose
a Cross-dOmain Product rEpresentation framework, namely COPE, which unifies
product representations in different domains through multimodal learning
including text and vision. Extensive experiments on downstream tasks
demonstrate the effectiveness of COPE in learning a joint feature space for all
product domains.
 
      
        Related papers
        - CTR-Driven Advertising Image Generation with Multimodal Large Language   Models [53.40005544344148]
 We explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective.
To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL)
Our method achieves state-of-the-art performance in both online and offline metrics.
 arXiv  Detail & Related papers  (2025-02-05T09:06:02Z)
- ASR-enhanced Multimodal Representation Learning for Cross-Domain Product   Retrieval [28.13183873658186]
 E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions.
Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate.
We propose ASR-enhanced Multimodal Product Representation Learning (AMPere)
 arXiv  Detail & Related papers  (2024-08-06T06:24:10Z)
- Spatiotemporal Graph Guided Multi-modal Network for Livestreaming   Product Retrieval [32.478352606125306]
 We propose a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products.
A long-rangetemporal graph network is further designed to achieve both instance-level interaction and frame-level matching.
We demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin.
 arXiv  Detail & Related papers  (2024-07-23T07:36:54Z)
- MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product
  Summarization [93.5217515566437]
 Multi-modal Product Summarization (MPS) aims to increase customers' desire to purchase by highlighting product characteristics.
Existing MPS methods can produce promising results, but they still lack end-to-end product summarization.
We propose an end-to-end multi-modal attribute-aware product summarization method (MMAPS) for generating high-quality product summaries in e-commerce.
 arXiv  Detail & Related papers  (2023-08-22T11:00:09Z)
- Cross-view Semantic Alignment for Livestreaming Product Recognition [24.38606354376169]
 We present LPR4M, a large-scale multimodal dataset that covers 34 categories.
LPR4M contains diverse videos and noise modality pairs while exhibiting a long-tailed distribution.
A novel Patch Feature Reconstruction loss is proposed to penalize the semantic misalignment between cross-view patches.
 arXiv  Detail & Related papers  (2023-08-09T12:23:41Z)
- Multi-queue Momentum Contrast for Microvideo-Product Retrieval [57.527227171945796]
 We formulate the microvideo-product retrieval task, which is the first attempt to explore the retrieval between the multi-modal and multi-modal instances.
A novel approach named Multi-Queue Momentum Contrast (MQMC) network is proposed for bidirectional retrieval.
A discriminative selection strategy with a multi-queue is used to distinguish the importance of different negatives based on their categories.
 arXiv  Detail & Related papers  (2022-12-22T03:47:14Z)
- e-CLIP: Large-Scale Vision-Language Representation Learning in
  E-commerce [9.46186546774799]
 We propose a contrastive learning framework that aligns language and visual models using unlabeled raw product text and images.
We present techniques we used to train large-scale representation learning models and share solutions that address domain-specific challenges.
 arXiv  Detail & Related papers  (2022-07-01T05:16:47Z)
- ItemSage: Learning Product Embeddings for Shopping Recommendations at
  Pinterest [60.841761065439414]
 At Pinterest, we build a single set of product embeddings called ItemSage to provide relevant recommendations in all shopping use cases.
This approach has led to significant improvements in engagement and conversion metrics, while reducing both infrastructure and maintenance cost.
 arXiv  Detail & Related papers  (2022-05-24T02:28:58Z)
- Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
  via Cross-modal Pretraining [108.86502855439774]
 We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval.
We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval.
We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
 arXiv  Detail & Related papers  (2021-07-30T12:11:24Z)
- Fashion Focus: Multi-modal Retrieval System for Video Commodity
  Localization in E-commerce [18.651201334846352]
 We present an innovative demonstration of multi-modal retrieval system called "Fashion Focus"
It enables to exactly localize the product images in the online video as the focuses.
Our system employs two procedures for analysis, including video content structuring and multi-modal retrieval, to automatically achieve accurate video-to-shop matching.
 arXiv  Detail & Related papers  (2021-02-09T09:45:04Z)
- Poet: Product-oriented Video Captioner for E-commerce [124.9936946822493]
 In e-commerce, a growing number of user-generated videos are used for product promotion. How to generate video descriptions that narrate the user-preferred product characteristics depicted in the video is vital for successful promoting.
We propose a product-oriented video captioner framework, abbreviated as Poet.
We show that Poet achieves consistent performance improvement over previous methods concerning generation quality, product aspects capturing, and lexical diversity.
 arXiv  Detail & Related papers  (2020-08-16T10:53:46Z)
- Comprehensive Information Integration Modeling Framework for Video
  Titling [124.11296128308396]
 We integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework.
To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization.
We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform.
 arXiv  Detail & Related papers  (2020-06-24T10:38:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.