M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale
Efficient Pretraining
- URL: http://arxiv.org/abs/2401.15896v2
- Date: Sun, 4 Feb 2024 04:30:07 GMT
- Title: M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale
Efficient Pretraining
- Authors: Qingpei Guo, Furong Xu, Hanxiao Zhang, Wang Ren, Ziping Ma, Lin Ju,
Jian Wang, Jingdong Chen, Ming Yang
- Abstract summary: We introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs.
To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation.
We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B.
- Score: 26.262677587795242
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language foundation models like CLIP have revolutionized the field of
artificial intelligence. Nevertheless, VLM models supporting multi-language,
e.g., in both Chinese and English, have lagged due to the relative scarcity of
large-scale pretraining datasets. Toward this end, we introduce a comprehensive
bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs,
aimed at enhancing multimodal foundation models to well understand images in
both languages. To handle such a scale of dataset, we propose a novel grouped
aggregation approach for image-text contrastive loss computation, which reduces
the communication overhead and GPU memory demands significantly, facilitating a
60% increase in training speed. We pretrain a series of bilingual image-text
foundation models with an enhanced fine-grained understanding ability on BM-6B,
the resulting models, dubbed as $M^2$-Encoders (pronounced "M-Square"), set new
benchmarks in both languages for multimodal retrieval and classification tasks.
Notably, Our largest $M^2$-Encoder-10B model has achieved top-1 accuracies of
88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification
setting, surpassing previously reported SoTA methods by 2.2% and 21.1%,
respectively. The $M^2$-Encoder series represents one of the most comprehensive
bilingual image-text foundation models to date, so we are making it available
to the research community for further exploration and development.
Related papers
- LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task [0.0]
This research explores the development of vision-language models for image retrieval in low-resource languages, specifically Azerbaijani.
We integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance.
Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on.
arXiv Detail & Related papers (2024-08-25T18:10:16Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs [50.17767479660832]
Vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to understand' the image input.
We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware.
arXiv Detail & Related papers (2023-07-13T17:51:58Z) - UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot
Vision-Language Tasks [60.46473247205654]
Using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models.
Our experiments show that unimodal pre-training outperforms state-of-the-art CLIP-based models.
arXiv Detail & Related papers (2023-06-07T18:26:22Z) - LAION-5B: An open large-scale dataset for training next generation
image-text models [16.129935376579326]
We present LAION-5B, a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language.
We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset.
We also provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation.
arXiv Detail & Related papers (2022-10-16T00:08:18Z) - Towards Making the Most of Multilingual Pretraining for Zero-Shot Neural
Machine Translation [74.158365847236]
SixT++ is a strong many-to-English NMT model that supports 100 source languages but is trained once with a parallel dataset from only six source languages.
It significantly outperforms CRISS and m2m-100, two strong multilingual NMT systems, with an average gain of 7.2 and 5.0 BLEU respectively.
arXiv Detail & Related papers (2021-10-16T10:59:39Z) - MURAL: Multimodal, Multitask Retrieval Across Languages [14.323816604663053]
MURAL is a dual encoder that solves two tasks: image-text matching and translation pair matching.
By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs.
It considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages.
arXiv Detail & Related papers (2021-09-10T22:26:05Z) - Contrastive Language-Image Pre-training for the Italian Language [4.804798944613199]
We present the first CLIP model for the Italian Language (CLIP-Italian) trained on more than 1.4 million image-text pairs.
Results show that CLIP-Italian outperforms the multilingual CLIP model on the tasks of image retrieval and zero-shot classification.
arXiv Detail & Related papers (2021-08-19T13:53:47Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.