Related papers: M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product Downstream Tasks

M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product Downstream Tasks

URL: http://arxiv.org/abs/2109.04275v1
Date: Thu, 9 Sep 2021 13:50:22 GMT
Title: M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product Downstream Tasks
Authors: Xiao Dong, Xunlin Zhan, Yangxin Wu, Yunchao Wei, Xiaoyong Wei, Minlong Lu, Xiaodan Liang
Abstract summary: We contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs. M5Product contains rich information of multiple modalities including image, text, table, video and audio.
Score: 94.80043324367858
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we aim to advance the research of multi-modal pre-training on E-commerce and subsequently contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs, covering more than 6,000 categories and 5,000 attributes. Generally, existing multi-modal datasets are either limited in scale or modality diversity. Differently, our M5Product is featured from the following aspects. First, the M5Product dataset is 500 times larger than the public multimodal dataset with the same number of modalities and nearly twice larger compared with the largest available text-image cross-modal dataset. Second, the dataset contains rich information of multiple modalities including image, text, table, video and audio, in which each modality can capture different views of semantic information (e.g. category, attributes, affordance, brand, preference) and complements the other. Third, to better accommodate with real-world problems, a few portion of M5Product contains incomplete modality pairs and noises while having the long-tailed distribution, which aligns well with real-world scenarios. Finally, we provide a baseline model M5-MMT that makes the first attempt to integrate the different modality configuration into an unified model for feature fusion to address the great challenge for semantic alignment. We also evaluate various multi-model pre-training state-of-the-arts for benchmarking their capabilities in learning from unlabeled data under the different number of modalities on the M5Product dataset. We conduct extensive experiments on four downstream tasks and provide some interesting findings on these modalities. Our dataset and related code are available at https://xiaodongsuper.github.io/M5Product_dataset.

Related papers

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data [71.352883755806]
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.
arXiv Detail & Related papers (2025-02-12T15:03:33Z)
MINIMA: Modality Invariant Image Matching [52.505282811925454]
We present MINIMA, a unified image matching framework for multiple cross-modal cases. We scale up the modalities from cheap but rich RGB-only matching data, by means of generative models. With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability.
arXiv Detail & Related papers (2024-12-27T02:39:50Z)
Multimodal Difference Learning for Sequential Recommendation [5.243083216855681]
We argue that user interests and item relationships vary across different modalities. We propose a novel Multimodal Learning framework for Sequential Recommendation, MDSRec. Results on five real-world datasets demonstrate the superiority of MDSRec over state-of-the-art baselines.
arXiv Detail & Related papers (2024-12-11T05:08:19Z)
Multimodal Banking Dataset: Understanding Client Needs through Event Sequences [41.470088044942756]
We present the industrial-scale publicly available multimodal banking dataset, MBD, that contains more than 1.5M corporate clients. All entries are properly anonymized from real proprietary bank data. We provide numerical results that demonstrate the superiority of our multi-modal baselines over single-modal techniques for each task.
arXiv Detail & Related papers (2024-09-26T07:07:08Z)
Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance [15.435695491233982]
We propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the Segment Anything Model (SAM) for multi-modal salient object detection (SOD) We develop underlineSAM with seunderlinemantic funderlineeature fuunderlinesion guidancunderlinee (Sammese) In the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. Specifically, in the mask decoder, a semantic-geometric
arXiv Detail & Related papers (2024-08-27T13:47:31Z)
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics. We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z)
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks [20.902155496422417]
Vision and language models exploit unrobust indicators in individual modalities instead of focusing on relevant information in each modality. We propose MM-SHAP, a performance-agnostic multimodality score based on Shapley values.
arXiv Detail & Related papers (2022-12-15T21:41:06Z)
Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and Toolkit [6.187270874122921]
We propose a toolkit for systematic multimodal VAE training and comparison. We present a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities.
arXiv Detail & Related papers (2022-09-07T10:26:28Z)
Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge. MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z)
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval [30.607369837039904]
CommerceMM is a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to a piece of content. We propose another 9 novel cross-modal and cross-pair retrieval tasks, called Omni-Retrieval pre-training. Our model achieves state-of-the-art performance on 7 commerce-related downstream tasks after fine-tuning.
arXiv Detail & Related papers (2022-02-15T08:23:59Z)
Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining [108.86502855439774]
We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval. We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval. We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
arXiv Detail & Related papers (2021-07-30T12:11:24Z)
M6: A Chinese Multimodal Pretrainer [66.51132343067458]
We construct the largest dataset for multimodal pretraining in Chinese, which consists of over 1.9TB images and 292GB texts. We propose a cross-modal pretraining method called M6, referring to Multi-Modality to Multi-Modality Multitask Mega-transformer.
arXiv Detail & Related papers (2021-03-01T07:46:27Z)
InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6. The model owns strong capability of modeling interaction between the information flows of different modalities. We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.