Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining
- URL: http://arxiv.org/abs/2107.14572v1
- Date: Fri, 30 Jul 2021 12:11:24 GMT
- Title: Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining
- Authors: Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi
Zhang, Hang Xu, Xiaodan Liang
- Abstract summary: We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval.
We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval.
We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
- Score: 108.86502855439774
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nowadays, customer's demands for E-commerce are more diversified, which
introduces more complications to the product retrieval industry. Previous
methods are either subject to single-modal input or perform supervised
image-level product retrieval, thus fail to accommodate real-life scenarios
where enormous weakly annotated multi-modal data are present. In this paper, we
investigate a more realistic setting that aims to perform weakly-supervised
multi-modal instance-level product retrieval among fine-grained product
categories. To promote the study of this challenging task, we contribute
Product1M, one of the largest multi-modal cosmetic datasets for real-world
instance-level retrieval. Notably, Product1M contains over 1 million
image-caption pairs and consists of two sample types, i.e., single-product and
multi-product samples, which encompass a wide variety of cosmetics brands. In
addition to the great diversity, Product1M enjoys several appealing
characteristics including fine-grained categories, complex combinations, and
fuzzy correspondence that well mimic the real-world scenes. Moreover, we
propose a novel model named Cross-modal contrAstive Product Transformer for
instance-level prodUct REtrieval (CAPTURE), that excels in capturing the
potential synergy between multi-modal inputs via a hybrid-stream transformer in
a self-supervised manner.CAPTURE generates discriminative instance features via
masked multi-modal learning as well as cross-modal contrastive pretraining and
it outperforms several SOTA cross-modal baselines. Extensive ablation studies
well demonstrate the effectiveness and the generalization capacity of our
model.
Related papers
- Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework [58.362064122489166]
This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances from multiple modalities when only a few labeled examples are available.
We propose a Generative Transfer Learning framework consisting of two stages: the first involves training on abundant unimodal data, and the second focuses on transfer learning to adapt to novel data.
Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets.
arXiv Detail & Related papers (2024-10-14T16:09:38Z) - Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance [15.435695491233982]
We propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the Segment Anything Model (SAM) for multi-modal salient object detection (SOD)
We develop underlineSAM with seunderlinemantic funderlineeature fuunderlinesion guidancunderlinee (Sammese)
In the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. Specifically, in the mask decoder, a semantic-geometric
arXiv Detail & Related papers (2024-08-27T13:47:31Z) - ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval [28.13183873658186]
E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions.
Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate.
We propose ASR-enhanced Multimodal Product Representation Learning (AMPere)
arXiv Detail & Related papers (2024-08-06T06:24:10Z) - MESEN: Exploit Multimodal Data to Design Unimodal Human Activity Recognition with Few Labels [11.853566358505434]
MESEN is a multimodal-empowered unimodal sensing framework.
Mesen exploits unlabeled multimodal data to extract effective unimodal features for each modality.
Mesen achieves significant performance improvements over state-of-the-art baselines.
arXiv Detail & Related papers (2024-04-02T13:54:05Z) - MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product
Summarization [93.5217515566437]
Multi-modal Product Summarization (MPS) aims to increase customers' desire to purchase by highlighting product characteristics.
Existing MPS methods can produce promising results, but they still lack end-to-end product summarization.
We propose an end-to-end multi-modal attribute-aware product summarization method (MMAPS) for generating high-quality product summaries in e-commerce.
arXiv Detail & Related papers (2023-08-22T11:00:09Z) - FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing [88.6654909354382]
We present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT) for face anti-spoofing.
FM-ViT can flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data.
Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-05-05T04:28:48Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.