M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product
Downstream Tasks
- URL: http://arxiv.org/abs/2109.04275v1
- Date: Thu, 9 Sep 2021 13:50:22 GMT
- Title: M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product
Downstream Tasks
- Authors: Xiao Dong, Xunlin Zhan, Yangxin Wu, Yunchao Wei, Xiaoyong Wei, Minlong
Lu, Xiaodan Liang
- Abstract summary: We contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs.
M5Product contains rich information of multiple modalities including image, text, table, video and audio.
- Score: 94.80043324367858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we aim to advance the research of multi-modal pre-training on
E-commerce and subsequently contribute a large-scale dataset, named M5Product,
which consists of over 6 million multimodal pairs, covering more than 6,000
categories and 5,000 attributes. Generally, existing multi-modal datasets are
either limited in scale or modality diversity. Differently, our M5Product is
featured from the following aspects. First, the M5Product dataset is 500 times
larger than the public multimodal dataset with the same number of modalities
and nearly twice larger compared with the largest available text-image
cross-modal dataset. Second, the dataset contains rich information of multiple
modalities including image, text, table, video and audio, in which each
modality can capture different views of semantic information (e.g. category,
attributes, affordance, brand, preference) and complements the other. Third, to
better accommodate with real-world problems, a few portion of M5Product
contains incomplete modality pairs and noises while having the long-tailed
distribution, which aligns well with real-world scenarios. Finally, we provide
a baseline model M5-MMT that makes the first attempt to integrate the different
modality configuration into an unified model for feature fusion to address the
great challenge for semantic alignment. We also evaluate various multi-model
pre-training state-of-the-arts for benchmarking their capabilities in learning
from unlabeled data under the different number of modalities on the M5Product
dataset. We conduct extensive experiments on four downstream tasks and provide
some interesting findings on these modalities. Our dataset and related code are
available at https://xiaodongsuper.github.io/M5Product_dataset.
Related papers
- Multimodal Banking Dataset: Understanding Client Needs through Event
Sequences [41.470088044942756]
We present the industrial-scale publicly available multimodal banking dataset, MBD, that contains more than 1.5M corporate clients.
All entries are properly anonymized from real proprietary bank data.
We provide numerical results that demonstrate the superiority of our multi-modal baselines over single-modal techniques for each task.
arXiv Detail & Related papers (2024-09-26T07:07:08Z) - Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance [15.435695491233982]
We propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the Segment Anything Model (SAM) for multi-modal salient object detection (SOD)
We develop underlineSAM with seunderlinemantic funderlineeature fuunderlinesion guidancunderlinee (Sammese)
In the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. Specifically, in the mask decoder, a semantic-geometric
arXiv Detail & Related papers (2024-08-27T13:47:31Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal
Contributions in Vision and Language Models & Tasks [20.902155496422417]
Vision and language models exploit unrobust indicators in individual modalities instead of focusing on relevant information in each modality.
We propose MM-SHAP, a performance-agnostic multimodality score based on Shapley values.
arXiv Detail & Related papers (2022-12-15T21:41:06Z) - Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and Toolkit [6.187270874122921]
We propose a toolkit for systematic multimodal VAE training and comparison.
We present a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities.
arXiv Detail & Related papers (2022-09-07T10:26:28Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - CommerceMM: Large-Scale Commerce MultiModal Representation Learning with
Omni Retrieval [30.607369837039904]
CommerceMM is a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to a piece of content.
We propose another 9 novel cross-modal and cross-pair retrieval tasks, called Omni-Retrieval pre-training.
Our model achieves state-of-the-art performance on 7 commerce-related downstream tasks after fine-tuning.
arXiv Detail & Related papers (2022-02-15T08:23:59Z) - Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining [108.86502855439774]
We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval.
We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval.
We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
arXiv Detail & Related papers (2021-07-30T12:11:24Z) - M6: A Chinese Multimodal Pretrainer [66.51132343067458]
We construct the largest dataset for multimodal pretraining in Chinese, which consists of over 1.9TB images and 292GB texts.
We propose a cross-modal pretraining method called M6, referring to Multi-Modality to Multi-Modality Multitask Mega-transformer.
arXiv Detail & Related papers (2021-03-01T07:46:27Z) - InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6.
The model owns strong capability of modeling interaction between the information flows of different modalities.
We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.