M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product
Downstream Tasks
- URL: http://arxiv.org/abs/2109.04275v1
- Date: Thu, 9 Sep 2021 13:50:22 GMT
- Title: M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product
Downstream Tasks
- Authors: Xiao Dong, Xunlin Zhan, Yangxin Wu, Yunchao Wei, Xiaoyong Wei, Minlong
Lu, Xiaodan Liang
- Abstract summary: We contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs.
M5Product contains rich information of multiple modalities including image, text, table, video and audio.
- Score: 94.80043324367858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we aim to advance the research of multi-modal pre-training on
E-commerce and subsequently contribute a large-scale dataset, named M5Product,
which consists of over 6 million multimodal pairs, covering more than 6,000
categories and 5,000 attributes. Generally, existing multi-modal datasets are
either limited in scale or modality diversity. Differently, our M5Product is
featured from the following aspects. First, the M5Product dataset is 500 times
larger than the public multimodal dataset with the same number of modalities
and nearly twice larger compared with the largest available text-image
cross-modal dataset. Second, the dataset contains rich information of multiple
modalities including image, text, table, video and audio, in which each
modality can capture different views of semantic information (e.g. category,
attributes, affordance, brand, preference) and complements the other. Third, to
better accommodate with real-world problems, a few portion of M5Product
contains incomplete modality pairs and noises while having the long-tailed
distribution, which aligns well with real-world scenarios. Finally, we provide
a baseline model M5-MMT that makes the first attempt to integrate the different
modality configuration into an unified model for feature fusion to address the
great challenge for semantic alignment. We also evaluate various multi-model
pre-training state-of-the-arts for benchmarking their capabilities in learning
from unlabeled data under the different number of modalities on the M5Product
dataset. We conduct extensive experiments on four downstream tasks and provide
some interesting findings on these modalities. Our dataset and related code are
available at https://xiaodongsuper.github.io/M5Product_dataset.
Related papers
- mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data [71.352883755806]
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space.
However, the limited labeled multimodal data often hinders embedding performance.
Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.
arXiv Detail & Related papers (2025-02-12T15:03:33Z) - MINIMA: Modality Invariant Image Matching [52.505282811925454]
We present MINIMA, a unified image matching framework for multiple cross-modal cases.
We scale up the modalities from cheap but rich RGB-only matching data, by means of generative models.
With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability.
arXiv Detail & Related papers (2024-12-27T02:39:50Z) - Multimodal Difference Learning for Sequential Recommendation [5.243083216855681]
We argue that user interests and item relationships vary across different modalities.
We propose a novel Multimodal Learning framework for Sequential Recommendation, MDSRec.
Results on five real-world datasets demonstrate the superiority of MDSRec over state-of-the-art baselines.
arXiv Detail & Related papers (2024-12-11T05:08:19Z) - Multimodal Banking Dataset: Understanding Client Needs through Event
Sequences [41.470088044942756]
We present the industrial-scale publicly available multimodal banking dataset, MBD, that contains more than 1.5M corporate clients.
All entries are properly anonymized from real proprietary bank data.
We provide numerical results that demonstrate the superiority of our multi-modal baselines over single-modal techniques for each task.
arXiv Detail & Related papers (2024-09-26T07:07:08Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and Toolkit [6.187270874122921]
We propose a toolkit for systematic multimodal VAE training and comparison.
We present a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities.
arXiv Detail & Related papers (2022-09-07T10:26:28Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - CommerceMM: Large-Scale Commerce MultiModal Representation Learning with
Omni Retrieval [30.607369837039904]
CommerceMM is a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to a piece of content.
We propose another 9 novel cross-modal and cross-pair retrieval tasks, called Omni-Retrieval pre-training.
Our model achieves state-of-the-art performance on 7 commerce-related downstream tasks after fine-tuning.
arXiv Detail & Related papers (2022-02-15T08:23:59Z) - InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6.
The model owns strong capability of modeling interaction between the information flows of different modalities.
We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.