Fugu-MT 論文翻訳(概要): DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

論文の概要: DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

arxiv url: http://arxiv.org/abs/2603.26164v1
Date: Fri, 27 Mar 2026 08:28:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.40106
Title: DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models
Title（参考訳）: DataFlex: 大規模言語モデルのデータ中心動的トレーニングのための統一フレームワーク
Authors: Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma, Rongyi Yu, Hengyi Feng, Shixuan Sun, Zimo Meng, Xiaochen Ma, Xuanlin Yang, Qifeng Cai, Ruichuan An, Bohan Zeng, Zhen Hao Wong, Chengyu Shen, Runming He, Zhaoyang Han, Yaowei Zheng, Fangcheng Fu, Conghui He, Bin Cui, Zhiyu Li, Weinan E, Wentao Zhang,
Abstract要約: LLaMA-Factory上に構築されたデータ中心の動的トレーニングフレームワークであるDataFlexを紹介します。 DataFlexは、サンプル選択、ドメイン混合調整調整、データ再重み付けという、動的なデータ最適化の3つの主要なパラダイムをサポートします。トレーナーの抽象化とモジュールコンポーネントを提供し、標準のLLMトレーニングをドロップインで置き換えることを可能にする。
参考スコア（独自算出の注目度）: 51.48564522455171
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.
Abstract（参考訳）: データ中心トレーニングは、モデルパラメータだけでなく、最適化中のトレーニングデータの選択、構成、重み付けを最適化することで、大きな言語モデル(LLM)を改善するための有望な方向として現れてきた。しかし、データ選択、データ混合最適化、データ再重み付けに対する既存のアプローチは、しばしば一貫性のないインターフェースを持つ独立したコードベースで開発され、再現性、公正な比較、実践的な統合を妨げる。本稿では,LLaMA-Factory上に構築されたデータ中心動的トレーニングフレームワークであるDataFlexについて述べる。 DataFlexは、サンプル選択、ドメイン混合調整、サンプル再重み付けという、動的なデータ最適化の3つの主要なパラダイムをサポートします。拡張可能なトレーナー抽象化とモジュラーコンポーネントを提供し、標準LLMトレーニングのドロップイン置換を可能にし、DeepSpeed ZeRO-3などの大規模設定をサポートするとともに、埋め込み抽出、推論、勾配計算といった主要なモデル依存の操作を統一する。複数のデータ中心の手法を包括的に実験する。ダイナミックデータの選択は、Mistral-7BとLlama-3.2-3Bの両方でMMLUの静的フルデータトレーニングを一貫して上回っている。データ混合の場合、DoReMiとODMは6Bと30BのトークンスケールでSlimPajama上でQwen2.5-1.5Bを事前トレーニングする場合、デフォルト比率よりもMMLU精度とコーパスレベルのパープレキシティの両方を改善する。 DataFlexはまた、オリジナルの実装よりも一貫したランタイム改善を実現しています。これらの結果は、LLMのデータ中心の動的トレーニングのために、DataFlexが効果的で、効率的で、再現可能なインフラを提供することを示している。

論文の概要: DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

関連論文リスト