Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining
- URL: http://arxiv.org/abs/2107.14572v1
- Date: Fri, 30 Jul 2021 12:11:24 GMT
- Title: Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining
- Authors: Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi
Zhang, Hang Xu, Xiaodan Liang
- Abstract summary: We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval.
We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval.
We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
- Score: 108.86502855439774
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nowadays, customer's demands for E-commerce are more diversified, which
introduces more complications to the product retrieval industry. Previous
methods are either subject to single-modal input or perform supervised
image-level product retrieval, thus fail to accommodate real-life scenarios
where enormous weakly annotated multi-modal data are present. In this paper, we
investigate a more realistic setting that aims to perform weakly-supervised
multi-modal instance-level product retrieval among fine-grained product
categories. To promote the study of this challenging task, we contribute
Product1M, one of the largest multi-modal cosmetic datasets for real-world
instance-level retrieval. Notably, Product1M contains over 1 million
image-caption pairs and consists of two sample types, i.e., single-product and
multi-product samples, which encompass a wide variety of cosmetics brands. In
addition to the great diversity, Product1M enjoys several appealing
characteristics including fine-grained categories, complex combinations, and
fuzzy correspondence that well mimic the real-world scenes. Moreover, we
propose a novel model named Cross-modal contrAstive Product Transformer for
instance-level prodUct REtrieval (CAPTURE), that excels in capturing the
potential synergy between multi-modal inputs via a hybrid-stream transformer in
a self-supervised manner.CAPTURE generates discriminative instance features via
masked multi-modal learning as well as cross-modal contrastive pretraining and
it outperforms several SOTA cross-modal baselines. Extensive ablation studies
well demonstrate the effectiveness and the generalization capacity of our
model.
Related papers
- MESEN: Exploit Multimodal Data to Design Unimodal Human Activity Recognition with Few Labels [11.853566358505434]
MESEN is a multimodal-empowered unimodal sensing framework.
Mesen exploits unlabeled multimodal data to extract effective unimodal features for each modality.
Mesen achieves significant performance improvements over state-of-the-art baselines.
arXiv Detail & Related papers (2024-04-02T13:54:05Z) - End-to-end multi-modal product matching in fashion e-commerce [0.6047429555885261]
We present a robust multi-modal product matching system in an industry setting.
We show how a human-in-the-loop process can be combined with model-based predictions to achieve near perfect precision.
arXiv Detail & Related papers (2024-03-18T09:12:16Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product
Summarization [93.5217515566437]
Multi-modal Product Summarization (MPS) aims to increase customers' desire to purchase by highlighting product characteristics.
Existing MPS methods can produce promising results, but they still lack end-to-end product summarization.
We propose an end-to-end multi-modal attribute-aware product summarization method (MMAPS) for generating high-quality product summaries in e-commerce.
arXiv Detail & Related papers (2023-08-22T11:00:09Z) - FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing [88.6654909354382]
We present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT) for face anti-spoofing.
FM-ViT can flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data.
Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-05-05T04:28:48Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - A Multimodal Late Fusion Model for E-Commerce Product Classification [7.463657960984954]
In this study, we investigated a multimodal late fusion approach based on text and image modalities to categorize e-commerce products on Rakuten.
Specifically, we developed modal specific state-of-the-art deep neural networks for each input modal, and then fused them at the decision level.
Our team named pa_curis won the 1st place with a macro-F1 of 0.9144 on the final leaderboard.
arXiv Detail & Related papers (2020-08-14T03:46:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.