Related papers: Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications

Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications

URL: http://arxiv.org/abs/2308.16354v1
Date: Wed, 30 Aug 2023 23:02:26 GMT
Title: Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications
Authors: Wenyi Wu, Karim Bouyarmane, Ismail Tutar
Abstract summary: We present Catalog Phrase Grounding (CPG), a model that can associate product textual data (title, brands) into corresponding regions of product images. We train the model in self-supervised fashion with 2.3 million image-text pairs synthesized from an e-commerce site. Experiments show that incorporating CPG representations into the existing production ensemble system leads to on average 5% recall improvement across all countries globally.
Score: 4.705291741591329
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Catalog Phrase Grounding (CPG), a model that can associate product textual data (title, brands) into corresponding regions of product images (isolated product region, brand logo region) for e-commerce vision-language applications. We use a state-of-the-art modulated multimodal transformer encoder-decoder architecture unifying object detection and phrase-grounding. We train the model in self-supervised fashion with 2.3 million image-text pairs synthesized from an e-commerce site. The self-supervision data is annotated with high-confidence pseudo-labels generated with a combination of teacher models: a pre-trained general domain phrase grounding model (e.g. MDETR) and a specialized logo detection model. This allows CPG, as a student model, to benefit from transfer knowledge from these base models combining general-domain knowledge and specialized knowledge. Beyond immediate catalog phrase grounding tasks, we can benefit from CPG representations by incorporating them as ML features into downstream catalog applications that require deep semantic understanding of products. Our experiments on product-brand matching, a challenging e-commerce application, show that incorporating CPG representations into the existing production ensemble system leads to on average 5% recall improvement across all countries globally (with the largest lift of 11% in a single country) at fixed 95% precision, outperforming other alternatives including a logo detection teacher model and ResNet50.

Related papers

Building a Few-Shot Cross-Domain Multilingual NLU Model for Customer Care [1.0129089187146396]
SOTA pre-trained models like multilingual-BERT, fine-tuned on annotated data have shown good performance in downstream tasks relevant to Customer Care.<n>We propose an embedder-cum-classifier model architecture which extends state-of-the-art domain-specific models to other domains with only a few labeled samples.<n> Experiments on Canada and Mexico e-commerce Customer Care dataset with few-shot intent detection show an increase in accuracy by 20-23%.
arXiv Detail & Related papers (2025-06-04T19:14:48Z)
An Interpretable Ensemble of Graph and Language Models for Improving Search Relevance in E-Commerce [22.449320058423886]
We propose Plug and Play Graph LAnguage Model (PP-GLAM), an explainable ensemble of plug and play models. Our approach uses a modular framework with uniform data processing pipelines. We show that PP-GLAM outperforms several state-of-the-art baselines and a proprietary model on real-world multilingual, multi-regional e-commerce datasets.
arXiv Detail & Related papers (2024-03-01T19:08:25Z)
A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation [47.70824723223262]
We propose a new setting for generating product descriptions from images, augmented by marketing keywords. We present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference. Experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods.
arXiv Detail & Related papers (2024-02-21T07:38:29Z)
Localized Symbolic Knowledge Distillation for Visual Commonsense Models [150.18129140140238]
We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model. We find that training on the localized commonsense corpus can successfully distill existing vision-language models to support a reference-as-input interface.
arXiv Detail & Related papers (2023-12-08T05:23:50Z)
Part-Aware Transformer for Generalizable Person Re-identification [138.99827526048205]
Domain generalization person re-identification (DG-ReID) aims to train a model on source domains and generalize well on unseen domains. We propose a pure Transformer model (termed Part-aware Transformer) for DG-ReID by designing a proxy task, named Cross-ID Similarity Learning (CSL) This proxy task allows the model to learn generic features because it only cares about the visual similarity of the parts regardless of the ID labels.
arXiv Detail & Related papers (2023-08-07T06:15:51Z)
Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z)
DATE: Domain Adaptive Product Seeker for E-commerce [75.25578276795383]
Product Retrieval (PR) and Grounding (PG) aim to seek image and object-level products respectively according to a textual query. We propose a bf Domain bf Adaptive Producbf t Sbf eeker (bf DATE) framework, regarding PR and PG as Product Seeking problem at different levels.
arXiv Detail & Related papers (2023-04-07T14:40:16Z)
e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce [9.46186546774799]
We propose a contrastive learning framework that aligns language and visual models using unlabeled raw product text and images. We present techniques we used to train large-scale representation learning models and share solutions that address domain-specific challenges.
arXiv Detail & Related papers (2022-07-01T05:16:47Z)
Automatic Generation of Product-Image Sequence in E-commerce [46.06263129000091]
Multi-modality Unified Imagesequence (MUIsC) is able to simultaneously detect all categories through learning rule violations. By Dec 2021, our AGPIS framework has generated high-standard images for about 1.5 million products and achieves 13.6% in reject rate.
arXiv Detail & Related papers (2022-06-26T23:38:42Z)
An Improved Deep Learning Approach For Product Recognition on Racks in Retail Stores [2.470815298095903]
Automated product recognition in retail stores is an important real-world application in the domain of Computer Vision and Pattern Recognition. We develop a two-stage object detection and recognition pipeline comprising of a Faster-RCNN-based object localizer and a ResNet-18-based image encoder. Each of the models is fine-tuned using appropriate data sets for better prediction and data augmentation is performed on each query image to prepare an extensive gallery set for fine-tuning the ResNet-18-based product recognition model.
arXiv Detail & Related papers (2022-02-26T06:51:36Z)
Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS) Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z)
Automatic Validation of Textual Attribute Values in E-commerce Catalog by Learning with Limited Labeled Data [61.789797281676606]
We propose a novel meta-learning latent variable approach, called MetaBridge. It can learn transferable knowledge from a subset of categories with limited labeled data. It can capture the uncertainty of never-seen categories with unlabeled data.
arXiv Detail & Related papers (2020-06-15T21:31:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.