Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training
- URL: http://arxiv.org/abs/2602.07824v1
- Date: Sun, 08 Feb 2026 05:06:34 GMT
- Title: Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training
- Authors: Yiwei Qin, Zhen Huang, Tiantian Mi, Weiye Si, Chenyang Zhou, Qipeng Guo, Siyuan Feng, Pengfei Liu,
- Abstract summary: We introduce Data Darwinism, a ten-level taxonomy that conceptualizes data-model co-evolution.<n>We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus.<n>We release the Darwin-Science corpus and daci-origin models to enable principled, co-evolutionary development.
- Score: 39.148751989348774
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.
Related papers
- CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning [63.44477226386808]
Chain-of-thought (CoT) distillation from advanced large language models (LLMs) has proven effective in general reasoning tasks.<n>But it struggles in scientific domains where even advanced models often produce incorrect or superficial reasoning.<n>We propose CoT-Evo, an evolutionary CoT distillation framework to overcome this problem.
arXiv Detail & Related papers (2025-10-15T05:29:56Z) - MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes [60.57770396565211]
We show that strong reasoning abilities can emerge with far less data.<n>MobileLLM-R50M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B.
arXiv Detail & Related papers (2025-09-29T15:43:59Z) - SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines [112.78540935201558]
We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations.<n>The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions.<n>It supports four capability families, covering up to 103 tasks across: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design.
arXiv Detail & Related papers (2025-09-25T17:52:06Z) - THE-Tree: Can Tracing Historical Evolution Enhance Scientific Verification and Reasoning? [16.91455372359864]
We introduce textbfTHE-Tree (textbfTechnology textbfHistory textbfEvolution Tree), a computational framework that constructs such domain-specific evolution trees from scientific literature.
arXiv Detail & Related papers (2025-06-26T20:44:51Z) - DarwinLM: Evolutionary Structured Pruning of Large Language Models [49.55509443720372]
Large Language Models (LLMs) have achieved significant success across various NLP tasks.<n>Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements.<n>We propose DarwinLM, a method for training-aware structured pruning.
arXiv Detail & Related papers (2025-02-11T18:59:35Z) - First Train to Generate, then Generate to Train: UnitedSynT5 for Few-Shot NLI [1.2642388972233847]
We present UnitedSynT5, an advanced extension of Entailment Few-Shot Learning (EFL)<n>We use a T5-based generator to synthesize additional premise-hypothesis pairs, which are rigorously cleaned and integrated into the training data.<n>We train a GTR-T5-XL model on this expanded dataset, achieving a new benchmark of 94.7% accuracy on the SNLI dataset, 94.0% accuracy on the E-SNLI dataset, and 92.6% accuracy on the MultiNLI dataset, surpassing the previous SOTA models.
arXiv Detail & Related papers (2024-12-12T13:21:09Z) - Towards Effective and Efficient Continual Pre-training of Large Language Models [163.34610964970258]
Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks.
This paper presents a technical report for continually pre-training Llama-3 (8B)
It significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model.
arXiv Detail & Related papers (2024-07-26T13:55:21Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [97.31347312130119]
SciRIFF (Scientific Resource for Instruction-Following and Finetuning) is a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks.<n>These tasks span five core scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification.<n> SciRIFF is unique in being entirely expert-written, high-quality instruction-following dataset for extracting and synthesizing information from research literature across diverse scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - MuCoMiD: A Multitask Convolutional Learning Framework for miRNA-Disease
Association Prediction [0.4061135251278187]
We propose a novel multi-tasking convolution-based approach, which we refer to as MuCoMiD.
MuCoMiD allows automatic feature extraction while incorporating knowledge from 4 heterogeneous biological information sources.
We construct large-scale experiments on standard benchmark datasets as well as our proposed larger independent test sets and case studies.
MuCoMiD shows an improvement of at least 5% in 5-fold CV evaluation on HMDDv2.0 and HMDDv3.0 datasets and at least 49% on larger independent test sets with unseen diseases and unseen diseases over state-of-the-art approaches.
arXiv Detail & Related papers (2021-08-08T10:01:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.