Technical Report: Competition Solution For BetterMixture
- URL: http://arxiv.org/abs/2403.13233v1
- Date: Wed, 20 Mar 2024 01:46:06 GMT
- Title: Technical Report: Competition Solution For BetterMixture
- Authors: Shuaijiang Zhao, Xiaoquan Fang,
- Abstract summary: This paper details our solution for the BetterMixture challenge, which focuses on the fine-tuning data mixing for large language models.
Our approach, which secured third place, incorporates data deduplication, low-level and high-level quality filtering, and diversity selection.
The foundation of our solution is Ke-Data-Juicer, demonstrating its robust capabilities in handling and optimizing data for large language models.
- Score: 1.2482895582813895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the era of flourishing large-scale models, the challenge of selecting and optimizing datasets from the vast and complex sea of data, to enhance the performance of large language models within the constraints of limited computational resources, has become paramount. This paper details our solution for the BetterMixture challenge, which focuses on the fine-tuning data mixing for large language models. Our approach, which secured third place, incorporates data deduplication, low-level and high-level quality filtering, and diversity selection. The foundation of our solution is Ke-Data-Juicer, an extension of Data-Juicer, demonstrating its robust capabilities in handling and optimizing data for large language models.
Related papers
- Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap [13.89078939095465]
We introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism.<n>Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks.
arXiv Detail & Related papers (2025-08-06T07:24:14Z) - Large-Scale Diverse Synthesis for Mid-Training [15.81154701009597]
BoostQA is a 100B-token large-scale question-answering dataset.<n>We propose a novel diversified pipeline to synthesize BoostQA.<n>Our method enables Llama-3 8B, mid-trained on a 40B-token dataset, to achieve an average improvement of $mathbf12.74%$ on MMLU and CMMLU.
arXiv Detail & Related papers (2025-08-02T11:37:16Z) - C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning [78.36259648527401]
C2-Evo is an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities.<n>We show that C2-Evo consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-07-22T12:27:08Z) - Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models [52.22235443948351]
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs)<n>Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale.<n>JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings.
arXiv Detail & Related papers (2025-05-28T11:06:54Z) - Diffusion-Augmented Coreset Expansion for Scalable Dataset Distillation [18.474302012851087]
We propose a two-stage solution for dataset distillation.
First, we compress the dataset by selecting only the most informative patches to form a coreset.
Next, we leverage a generative foundation model to dynamically expand this compressed set in real-time.
We demonstrate a significant improvement of over 10% compared to the state-of-the-art on several large-scale dataset distillation benchmarks.
arXiv Detail & Related papers (2024-12-05T23:40:27Z) - RedPajama: an Open Dataset for Training Large Language Models [80.74772646989423]
We identify three core data-related challenges that must be addressed to advance open-source language models.
These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis.
We release RedPajama-V1, an open reproduction of the LLaMA training dataset, and RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata.
arXiv Detail & Related papers (2024-11-19T09:35:28Z) - Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch [54.12139707822201]
We propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method.<n>By generating diverse questions from scratch, we produce a dataset of 1 million problem-solution pairs.<n>Our experiments demonstrate that models trained on our data outperform existing open-source datasets.
arXiv Detail & Related papers (2024-10-24T12:42:04Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Rethinking Data Selection at Scale: Random Selection is Almost All You Need [39.14807071480125]
Supervised fine-tuning is crucial for aligning Large Language Models with human instructions.
Most existing data selection techniques are designed for small-scale data pools.
arXiv Detail & Related papers (2024-10-12T02:48:34Z) - Unsupervised Data Validation Methods for Efficient Model Training [0.0]
State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT) and vision-language models (VLM) rely heavily on large datasets.
This research explores key areas such as defining "quality data," developing methods for generating appropriate data and enhancing accessibility to model training.
arXiv Detail & Related papers (2024-10-10T13:00:53Z) - SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation [29.454948190814765]
In recent years, the data collected for artificial intelligence has grown to an unmanageable amount.
We propose a framework to select the most semantically diverse and important dataset portion.
We further semantically enrich it by discovering meaningful new data from a massive unlabeled data pool.
arXiv Detail & Related papers (2024-09-20T19:17:52Z) - Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification [2.5091334993691206]
Development of a robust deep-learning model for retinal disease diagnosis requires a substantial dataset for training.
The capacity to generalize effectively on smaller datasets remains a persistent challenge.
We've combined a wide range of data sources to improve performance and generalization to new data.
arXiv Detail & Related papers (2024-09-17T17:22:35Z) - CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare [12.218718086529462]
This study focuses on the Comprehensive Medical Benchmark in Chinese (CMB)
We successfully trained a smaller base model to achieve scores comparable to larger models.
By integrating a wide range of instructional content, our approach addresses potential issues such as data quality inconsistencies.
arXiv Detail & Related papers (2024-07-29T05:00:48Z) - Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a novel sandbox suite tailored for integrated data-model co-development.
This sandbox provides a comprehensive experimental platform, enabling rapid iteration and insight-driven refinement of both data and models.
We also uncover fruitful insights gleaned from exhaustive benchmarks, shedding light on the critical interplay between data quality, diversity, and model behavior.
arXiv Detail & Related papers (2024-07-16T14:40:07Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.