e-CLIP: Large-Scale Vision-Language Representation Learning in
E-commerce
- URL: http://arxiv.org/abs/2207.00208v1
- Date: Fri, 1 Jul 2022 05:16:47 GMT
- Title: e-CLIP: Large-Scale Vision-Language Representation Learning in
E-commerce
- Authors: Wonyoung Shin, Jonghun Park, Taekang Woo, Yongwoo Cho, Kwangjin Oh,
Hwanjun Song
- Abstract summary: We propose a contrastive learning framework that aligns language and visual models using unlabeled raw product text and images.
We present techniques we used to train large-scale representation learning models and share solutions that address domain-specific challenges.
- Score: 9.46186546774799
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding vision and language representations of product content is vital
for search and recommendation applications in e-commerce. As a backbone for
online shopping platforms and inspired by the recent success in representation
learning research, we propose a contrastive learning framework that aligns
language and visual models using unlabeled raw product text and images. We
present techniques we used to train large-scale representation learning models
and share solutions that address domain-specific challenges. We study the
performance using our pre-trained model as backbones for diverse downstream
tasks, including category classification, attribute extraction, product
matching, product clustering, and adult product recognition. Experimental
results show that our proposed method outperforms the baseline in each
downstream task regarding both single modality and multiple modalities.
Related papers
- Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts [11.752632557524969]
We propose contrastive learning with data augmentation to disentangle content features from the original representations.
Our experiments across diverse datasets demonstrate significant improvements in zero-shot and few-shot classification tasks.
arXiv Detail & Related papers (2023-11-28T03:00:59Z) - ITEm: Unsupervised Image-Text Embedding Learning for eCommerce [9.307841602452678]
Product embedding serves as a cornerstone for a wide range of applications in eCommerce.
We present an image-text embedding model (ITEm) that is designed to better attend to image and text modalities.
We evaluate the pre-trained ITEm on two tasks: the search for extremely similar products and the prediction of product categories.
arXiv Detail & Related papers (2023-10-22T15:39:44Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Efficient Large-Scale Visual Representation Learning And Evaluation [0.13192560874022083]
We describe challenges in e-commerce vision applications at scale and highlight methods to efficiently train, evaluate, and serve visual representations.
We present ablation studies evaluating visual representations in several downstream tasks.
We include online results from deployed machine learning systems in production on a large scale e-commerce platform.
arXiv Detail & Related papers (2023-05-22T18:25:03Z) - Unified Vision-Language Representation Modeling for E-Commerce
Same-Style Products Retrieval [12.588713044749177]
Same-style products retrieval plays an important role in e-commerce platforms.
We propose a unified vision-language modeling method for e-commerce same-style products retrieval.
It is capable of cross-modal product-to-product retrieval, as well as style transfer and user-interactive search.
arXiv Detail & Related papers (2023-02-10T07:24:23Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - eProduct: A Million-Scale Visual Search Benchmark to Address Product
Recognition Challenges [8.204924070199866]
eProduct is a benchmark dataset for training and evaluation on various visual search solutions in a real-world setting.
We present eProduct as a training set and an evaluation set, where the training set contains 1.3M+ listing images with titles and hierarchical category labels, for model development.
We will present eProduct's construction steps, provide analysis about its diversity and cover the performance of baseline models trained on it.
arXiv Detail & Related papers (2021-07-13T05:28:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.