Related papers: MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

URL: http://arxiv.org/abs/2406.11271v2
Date: Wed, 24 Jul 2024 02:59:40 GMT
Title: MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
Authors: Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt,
Abstract summary: We release MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises one trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS.
Score: 113.9621845919304
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, diverse open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises one trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires substantial engineering effort, sharing the data curation process and releasing the dataset greatly benefits the community. Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS. Our data and code will be released at https://github.com/mlfoundations/MINT-1T.

Related papers

Emerging Properties in Unified Multimodal Pretraining [32.856334401494145]
We introduce BAGEL, an open-source foundational model that supports multimodal understanding and generation.<n>BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data.<n>It significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks.
arXiv Detail & Related papers (2025-05-20T17:59:30Z)
Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis [19.75619888353222]
We propose a new method, Oasis, to synthesize high-quality multi-modal data with only images. Oasis breaks through traditional methods by prompting only images to the MLLMs. Our method features a delicate quality control method which ensures the data quality.
arXiv Detail & Related papers (2025-03-11T08:25:40Z)
MMBind: Unleashing the Potential of Distributed and Heterogeneous Data for Multimodal Learning in IoT [11.884646027921173]
We propose MMBind, a new framework for multimodal learning on distributed and heterogeneous IoT data. We demonstrate that data of different modalities observing similar events, even captured at different times and locations, can be effectively used for multimodal training.
arXiv Detail & Related papers (2024-11-18T23:34:07Z)
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data [35.85909368345219]
We introduce Infinity-MM, a large-scale multimodal instruction dataset. We perform unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy. We propose a synthetic instruction generation method based on a tagging system and open-source Vision-Language Models.
arXiv Detail & Related papers (2024-10-24T09:03:48Z)
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning [58.7966588457529]
InfiMM-WebMath-40B is a high-quality dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl. Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model. Our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math.
arXiv Detail & Related papers (2024-09-19T08:41:21Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
Time-MMD: Multi-Domain Multimodal Dataset for Time Series Analysis [40.44013652777716]
Time-MMD is the first multi-domain, multimodal time series dataset. MM-TSFlib is the first multimodal time-series forecasting library.
arXiv Detail & Related papers (2024-06-12T20:20:09Z)
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models [69.96148259273065]
"Wan Juan" is a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale.
arXiv Detail & Related papers (2023-08-21T14:40:48Z)
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos [106.06278332186106]
Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction. Numerous limitations exist within existing public MSMO datasets. We have meticulously curated the textbfMMSum dataset.
arXiv Detail & Related papers (2023-06-07T07:43:11Z)
M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product Downstream Tasks [94.80043324367858]
We contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs. M5Product contains rich information of multiple modalities including image, text, table, video and audio.
arXiv Detail & Related papers (2021-09-09T13:50:22Z)
Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis [16.32509144501822]
We propose a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual Information (MI) in unimodal input pairs. The framework is jointly trained with the main task (MSA) to improve the performance of the downstream MSA task.
arXiv Detail & Related papers (2021-09-01T14:45:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.