WanJuan: A Comprehensive Multimodal Dataset for Advancing English and
Chinese Large Models
- URL: http://arxiv.org/abs/2308.10755v3
- Date: Fri, 15 Sep 2023 09:52:14 GMT
- Title: WanJuan: A Comprehensive Multimodal Dataset for Advancing English and
Chinese Large Models
- Authors: Conghui He, Zhenjiang Jin, Chao Xu, Jiantao Qiu, Bin Wang, Wei Li,
Hang Yan, Jiaqi Wang, Dahua Lin
- Abstract summary: "Wan Juan" is a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources.
It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale.
- Score: 69.96148259273065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the
development of large models, leading to the creation of numerous impressive
large language models(LLMs) and multimodal large language models (MLLMs). These
cutting-edge models owe their remarkable performance to high-quality data.
However, the details of the training data used in leading paradigms are often
kept confidential. This lack of transparency, coupled with the scarcity of
open-source data, impedes further developments within the community. As a
response, this paper presents "Wan Juan", a large-scale multimodal dataset
composed of both Chinese and English data, collected from a wide range of web
sources. The dataset incorporates text, image-text, and video modalities, with
a total volume exceeding 2TB. It was utilized in the training of InternLM, a
model that demonstrated significant advantages in multi-dimensional evaluations
when compared to models of a similar scale. All data can be accessed at
https://opendatalab.org.cn/WanJuan1.0.
Related papers
- RedPajama: an Open Dataset for Training Large Language Models [80.74772646989423]
We identify three core data-related challenges that must be addressed to advance open-source language models.
These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis.
We release RedPajama-V1, an open reproduction of the LLaMA training dataset, and RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata.
arXiv Detail & Related papers (2024-11-19T09:35:28Z) - Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models [146.18107944503436]
Molmo is a new family of VLMs that are state-of-the-art in their class of openness.
Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators.
We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future.
arXiv Detail & Related papers (2024-09-25T17:59:51Z) - CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare [12.218718086529462]
This study focuses on the Comprehensive Medical Benchmark in Chinese (CMB)
We successfully trained a smaller base model to achieve scores comparable to larger models.
By integrating a wide range of instructional content, our approach addresses potential issues such as data quality inconsistencies.
arXiv Detail & Related papers (2024-07-29T05:00:48Z) - X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment [4.571088742209442]
We create a 91K English-Korean-Chinese multilingual, multimodal training dataset.
We develop a bilingual multimodal model that exhibits excellent performance in both Korean and English.
arXiv Detail & Related papers (2024-03-18T01:14:47Z) - EXMODD: An EXplanatory Multimodal Open-Domain Dialogue dataset [20.445453185198186]
We propose a Multimodal Data Construction Framework (MDCF) to alleviate the significant human and resource expenditure in data collection.
MDCF automatically provides explanation for a given image and its corresponding dialogue, which can provide a certain degree of interpretability.
Experiments indicate a positive correlation between the model's ability to generate accurate understandings and high-quality responses.
arXiv Detail & Related papers (2023-10-17T03:28:29Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - Enhancing Chat Language Models by Scaling High-quality Instructional
Conversations [91.98516412612739]
We first provide a systematically designed, diverse, informative, large-scale dataset of instructional conversations, UltraChat.
Our objective is to capture the breadth of interactions that a human might have with an AI assistant.
We fine-tune a LLaMA model to create a powerful conversational model, UltraLLaMA.
arXiv Detail & Related papers (2023-05-23T16:49:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.