Extending CLIP for Category-to-image Retrieval in E-commerce
- URL: http://arxiv.org/abs/2112.11294v1
- Date: Tue, 21 Dec 2021 15:33:23 GMT
- Title: Extending CLIP for Category-to-image Retrieval in E-commerce
- Authors: Mariya Hendriksen, Maurits Bleeker, Svitlana Vakulenko, Nanne van
Noord, Ernst Kuiper, and Maarten de Rijke
- Abstract summary: E-commerce provides rich multimodal data that is barely leveraged in practice.
In practice, there is often a mismatch between a textual and a visual representation of a given category.
We introduce the task of category-to-image retrieval in e-commerce and propose a model for the task, CLIP-ITA.
- Score: 36.386210802938656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: E-commerce provides rich multimodal data that is barely leveraged in
practice. One aspect of this data is a category tree that is being used in
search and recommendation. However, in practice, during a user's session there
is often a mismatch between a textual and a visual representation of a given
category. Motivated by the problem, we introduce the task of category-to-image
retrieval in e-commerce and propose a model for the task, CLIP-ITA. The model
leverages information from multiple modalities (textual, visual, and attribute
modality) to create product representations. We explore how adding information
from multiple modalities (textual, visual, and attribute modality) impacts the
model's performance. In particular, we observe that CLIP-ITA significantly
outperforms a comparable model that leverages only the visual modality and a
comparable model that leverages the visual and attribute modality.
Related papers
- CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP [56.199779065855004]
We introduce CLIPErase, a novel approach that disentangles and selectively forgets both visual and textual associations.
Experiments on the CIFAR-100 and Flickr30K datasets demonstrate that CLIPErase effectively forgets designated associations in zero-shot tasks for multimodal samples.
arXiv Detail & Related papers (2024-10-30T17:51:31Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - OVMR: Open-Vocabulary Recognition with Multi-Modal References [96.21248144937627]
Existing works have proposed different methods to embed category cues into the model, eg, through few-shot fine-tuning.
This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images.
The proposed OVMR is a plug-and-play module, and works well with exemplar images randomly crawled from the Internet.
arXiv Detail & Related papers (2024-06-07T06:45:28Z) - A Multi-Granularity Matching Attention Network for Query Intent
Classification in E-commerce Retrieval [9.034096715927731]
This paper proposes a Multi-granularity Matching Attention Network (MMAN) for query intent classification.
MMAN contains three modules: a self-matching module, a char-level matching module, and a semantic-level matching module.
We conduct extensive offline and online A/B experiments, and the results show that the MMAN significantly outperforms the strong baselines.
arXiv Detail & Related papers (2023-03-28T10:25:17Z) - Unified Vision-Language Representation Modeling for E-Commerce
Same-Style Products Retrieval [12.588713044749177]
Same-style products retrieval plays an important role in e-commerce platforms.
We propose a unified vision-language modeling method for e-commerce same-style products retrieval.
It is capable of cross-modal product-to-product retrieval, as well as style transfer and user-interactive search.
arXiv Detail & Related papers (2023-02-10T07:24:23Z) - e-CLIP: Large-Scale Vision-Language Representation Learning in
E-commerce [9.46186546774799]
We propose a contrastive learning framework that aligns language and visual models using unlabeled raw product text and images.
We present techniques we used to train large-scale representation learning models and share solutions that address domain-specific challenges.
arXiv Detail & Related papers (2022-07-01T05:16:47Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - Semantic Representation and Dependency Learning for Multi-Label Image
Recognition [76.52120002993728]
We propose a novel and effective semantic representation and dependency learning (SRDL) framework to learn category-specific semantic representation for each category.
Specifically, we design a category-specific attentional regions (CAR) module to generate channel/spatial-wise attention matrices to guide model.
We also design an object erasing (OE) module to implicitly learn semantic dependency among categories by erasing semantic-aware regions.
arXiv Detail & Related papers (2022-04-08T00:55:15Z) - PAM: Understanding Product Images in Cross Product Category Attribute
Extraction [40.332066960433245]
This work proposes a more inclusive framework that fully utilizes different modalities for attribute extraction.
Inspired by recent works in visual question answering, we use a transformer based sequence to sequence model to fuse representations of product text, Optical Character Recognition (OCR) tokens and visual objects detected in the product image.
The framework is further extended with the capability to extract attribute value across multiple product categories with a single model.
arXiv Detail & Related papers (2021-06-08T18:30:17Z) - Large Scale Multimodal Classification Using an Ensemble of Transformer
Models and Co-Attention [2.842794675894731]
We describe our methodology and results for the SIGIR eCom Rakuten Data Challenge.
We employ a dual attention technique to model image-text relationships using pretrained language and image embeddings.
arXiv Detail & Related papers (2020-11-23T21:22:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.