M6: A Chinese Multimodal Pretrainer
- URL: http://arxiv.org/abs/2103.00823v2
- Date: Tue, 2 Mar 2021 06:03:16 GMT
- Title: M6: A Chinese Multimodal Pretrainer
- Authors: Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming Ding, Yichang Zhang,
Peng Wang, Ang Wang, Le Jiang, Xianyan Jia, Jie Zhang, Jianwei Zhang, Xu Zou,
Zhikang Li, Xiaodong Deng, Jie Liu, Jinbao Xue, Huiling Zhou, Jianxin Ma, Jin
Yu, Yong Li, Wei Lin, Jingren Zhou, Jie Tang, Hongxia Yang
- Abstract summary: We construct the largest dataset for multimodal pretraining in Chinese, which consists of over 1.9TB images and 292GB texts.
We propose a cross-modal pretraining method called M6, referring to Multi-Modality to Multi-Modality Multitask Mega-transformer.
- Score: 66.51132343067458
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we construct the largest dataset for multimodal pretraining in
Chinese, which consists of over 1.9TB images and 292GB texts that cover a wide
range of domains. We propose a cross-modal pretraining method called M6,
referring to Multi-Modality to Multi-Modality Multitask Mega-transformer, for
unified pretraining on the data of single modality and multiple modalities. We
scale the model size up to 10 billion and 100 billion parameters, and build the
largest pretrained model in Chinese. We apply the model to a series of
downstream applications, and demonstrate its outstanding performance in
comparison with strong baselines. Furthermore, we specifically design a
downstream task of text-guided image generation, and show that the finetuned M6
can create high-quality images with high resolution and abundant details.
Related papers
- Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models [111.97026994761254]
Mixture-of-Transformers (MoT) is a sparse multi-modal transformer architecture.
MoT decouples non-embedding parameters of the model by modality.
We evaluate MoT across multiple settings and model scales.
arXiv Detail & Related papers (2024-11-07T18:59:06Z) - EMMeTT: Efficient Multimodal Machine Translation Training [26.295981183965566]
We propose a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST)
To handle joint multimodal training, we propose a novel training framework called EMMeTT.
The resultant Multimodal Translation Model produces strong text and speech translation results at the same time.
arXiv Detail & Related papers (2024-09-20T14:03:23Z) - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale
Efficient Pretraining [26.262677587795242]
We introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs.
To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation.
We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B.
arXiv Detail & Related papers (2024-01-29T05:43:33Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product
Downstream Tasks [94.80043324367858]
We contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs.
M5Product contains rich information of multiple modalities including image, text, table, video and audio.
arXiv Detail & Related papers (2021-09-09T13:50:22Z) - InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6.
The model owns strong capability of modeling interaction between the information flows of different modalities.
We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.