Unified Vision-Language Representation Modeling for E-Commerce
Same-Style Products Retrieval
- URL: http://arxiv.org/abs/2302.05093v1
- Date: Fri, 10 Feb 2023 07:24:23 GMT
- Title: Unified Vision-Language Representation Modeling for E-Commerce
Same-Style Products Retrieval
- Authors: Ben Chen, Linbo Jin, Xinxin Wang, Dehong Gao, Wen Jiang, Wei Ning
- Abstract summary: Same-style products retrieval plays an important role in e-commerce platforms.
We propose a unified vision-language modeling method for e-commerce same-style products retrieval.
It is capable of cross-modal product-to-product retrieval, as well as style transfer and user-interactive search.
- Score: 12.588713044749177
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Same-style products retrieval plays an important role in e-commerce
platforms, aiming to identify the same products which may have different text
descriptions or images. It can be used for similar products retrieval from
different suppliers or duplicate products detection of one supplier. Common
methods use the image as the detected object, but they only consider the visual
features and overlook the attribute information contained in the textual
descriptions, and perform weakly for products in image less important
industries like machinery, hardware tools and electronic component, even if an
additional text matching module is added. In this paper, we propose a unified
vision-language modeling method for e-commerce same-style products retrieval,
which is designed to represent one product with its textual descriptions and
visual contents. It contains one sampling skill to collect positive pairs from
user click log with category and relevance constrained, and a novel contrastive
loss unit to model the image, text, and image+text representations into one
joint embedding space. It is capable of cross-modal product-to-product
retrieval, as well as style transfer and user-interactive search. Offline
evaluations on annotated data demonstrate its superior retrieval performance,
and online testings show it can attract more clicks and conversions. Moreover,
this model has already been deployed online for similar products retrieval in
alibaba.com, the largest B2B e-commerce platform in the world.
Related papers
- Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - A Multimodal In-Context Tuning Approach for E-Commerce Product
Description Generation [47.70824723223262]
We propose a new setting for generating product descriptions from images, augmented by marketing keywords.
We present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference.
Experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods.
arXiv Detail & Related papers (2024-02-21T07:38:29Z) - Transformer-empowered Multi-modal Item Embedding for Enhanced Image
Search in E-Commerce [20.921870288665627]
Multi-modal Item Embedding Model (MIEM) is capable of utilizing both textual information and multiple images about a product to construct meaningful product features.
MIEM has become an integral part of the Shopee image search platform.
arXiv Detail & Related papers (2023-11-29T08:09:50Z) - Mutual Query Network for Multi-Modal Product Image Segmentation [13.192334066413837]
We propose a mutual query network to segment products based on both visual and linguistic modalities.
To promote the research in this field, we also construct a Multi-Modal Product dataset (MMPS)
The proposed method significantly outperforms the state-of-the-art methods on MMPS.
arXiv Detail & Related papers (2023-06-26T03:18:38Z) - Product Information Extraction using ChatGPT [69.12244027050454]
This paper explores the potential of ChatGPT for extracting attribute/value pairs from product descriptions.
Our results show that ChatGPT achieves a performance similar to a pre-trained language model but requires much smaller amounts of training data and computation for fine-tuning.
arXiv Detail & Related papers (2023-06-23T09:30:01Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - e-CLIP: Large-Scale Vision-Language Representation Learning in
E-commerce [9.46186546774799]
We propose a contrastive learning framework that aligns language and visual models using unlabeled raw product text and images.
We present techniques we used to train large-scale representation learning models and share solutions that address domain-specific challenges.
arXiv Detail & Related papers (2022-07-01T05:16:47Z) - Automatic Generation of Product-Image Sequence in E-commerce [46.06263129000091]
Multi-modality Unified Imagesequence (MUIsC) is able to simultaneously detect all categories through learning rule violations.
By Dec 2021, our AGPIS framework has generated high-standard images for about 1.5 million products and achieves 13.6% in reject rate.
arXiv Detail & Related papers (2022-06-26T23:38:42Z) - Extending CLIP for Category-to-image Retrieval in E-commerce [36.386210802938656]
E-commerce provides rich multimodal data that is barely leveraged in practice.
In practice, there is often a mismatch between a textual and a visual representation of a given category.
We introduce the task of category-to-image retrieval in e-commerce and propose a model for the task, CLIP-ITA.
arXiv Detail & Related papers (2021-12-21T15:33:23Z) - PAM: Understanding Product Images in Cross Product Category Attribute
Extraction [40.332066960433245]
This work proposes a more inclusive framework that fully utilizes different modalities for attribute extraction.
Inspired by recent works in visual question answering, we use a transformer based sequence to sequence model to fuse representations of product text, Optical Character Recognition (OCR) tokens and visual objects detected in the product image.
The framework is further extended with the capability to extract attribute value across multiple product categories with a single model.
arXiv Detail & Related papers (2021-06-08T18:30:17Z) - Automatic Validation of Textual Attribute Values in E-commerce Catalog
by Learning with Limited Labeled Data [61.789797281676606]
We propose a novel meta-learning latent variable approach, called MetaBridge.
It can learn transferable knowledge from a subset of categories with limited labeled data.
It can capture the uncertainty of never-seen categories with unlabeled data.
arXiv Detail & Related papers (2020-06-15T21:31:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.