Fugu-MT 論文翻訳(概要): Data-Juicer: A One-Stop Data Processing System for Large Language Models

論文の概要: Data-Juicer: A One-Stop Data Processing System for Large Language Models

arxiv url: http://arxiv.org/abs/2309.02033v3
Date: Wed, 20 Dec 2023 08:27:40 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-21 21:48:07.669329
Title: Data-Juicer: A One-Stop Data Processing System for Large Language Models
Title（参考訳）: Data-Juicer:大規模言語モデルのためのワンストップデータ処理システム
Authors: Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, Jingren Zhou
Abstract要約: データレシピは、大規模言語モデル(LLM)をトレーニングするための異なるソースからのデータの混合である。我々はData-Juicerという新しいシステムを構築し、多様なデータレシピを効率的に生成できる。 Data-Juicerから派生したデータレシピは、最先端のLLMで顕著に改善されている。
参考スコア（独自算出の注目度）: 73.27731037450995
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The immense evolution in Large Language Models (LLMs) has underscored the importance of massive, heterogeneous, and high-quality data. A data recipe is a mixture of data from different sources for training LLMs, which plays a vital role in LLMs' performance. Existing open-source tools for LLM data processing are mostly tailored for specific data recipes. To continuously uncover the potential of LLMs, incorporate data from new sources, and improve LLMs' performance, we build a new system named Data-Juicer, with which we can efficiently generate diverse data recipes, explore different possibilities in forming data mixtures, and evaluate their effects on model performance. Different from traditional data-analytics pipelines, Data-Juicer faces some unique challenges. Firstly, the possible data sources for forming data recipes are truly heterogeneous and massive with various qualities. Secondly, it is extremely expensive to precisely evaluate data recipes' impact on LLMs' performance. Thirdly, the end users of Data-Juicer, model developers, need sufficient flexibility to configure and evaluate different data recipes. Data-Juicer features a fine-grained abstraction of pipelines for constructing data recipes, with over 50 built-in operators for easy composition and extension. By incorporating visualization and auto-evaluation capabilities, Data-Juicer enables a timely feedback loop for both LLM pre-training and fine-tuning. Further, Data-Juicer is optimized and integrated with ecosystems for LLM training, evaluation, and distributed computing. The data recipes derived with Data-Juicer gain notable improvements on state-of-the-art LLMs, by up to 7.45% increase in averaged score across 16 LLM benchmarks and 17.5% higher win rate in pair-wise GPT-4 evaluations. Our system, data recipes, and tutorials are released, calling for broader data-centric research on training and understanding LLMs.
Abstract（参考訳）: 大規模言語モデル(LLM)の膨大な進化は、大規模で異質で高品質なデータの重要性を強調している。データレシピは、LLMのパフォーマンスにおいて重要な役割を果たすLLMをトレーニングするための異なるソースからのデータの混合である。 LLMデータ処理のための既存のオープンソースツールは、主に特定のデータレシピに適したものだ。 llmの可能性を継続的に解明し、新たなソースからのデータを取り込んで、llmsのパフォーマンスを向上させるために、さまざまなデータレシピを効率的に生成し、データ混合の形成におけるさまざまな可能性を調査し、モデルパフォーマンスへの影響を評価する、data-juicerという新しいシステムを構築した。従来のデータ分析パイプラインとは異なり、Data-Juicerにはいくつかの固有の課題がある。第一に、データレシピを形成するためのデータソースは、真に異質で、様々な性質を持つ。第2に、LCMの性能に対するデータレシピの影響を正確に評価することは極めて高価である。第3に,モデル開発者であるdata-juicerのエンドユーザは,さまざまなデータレシピの設定と評価に十分な柔軟性が必要です。 data-juicerは、データレシピ構築のためのパイプラインの詳細な抽象化と、構成と拡張を簡単にするための50以上の組み込みオペレータを備えている。可視化と自動評価機能を組み込むことで、Data-JuicerはLLM事前トレーニングと微調整の両方のタイムリーなフィードバックループを可能にする。さらに、Data-JuicerはLLMトレーニング、評価、分散コンピューティングのためのエコシステムに最適化され、統合されている。 Data-Juicer から派生したデータレシピは、最先端の LLM に対して顕著に改善され、16 LLM ベンチマークの平均スコアは7.45%増加し、ペアワイド GPT-4 評価では17.5%上昇した。我々のシステム、データレシピ、チュートリアルがリリースされ、LLMの学習と理解に関するより広範なデータ中心の研究が求められます。

論文の概要: Data-Juicer: A One-Stop Data Processing System for Large Language Models

関連論文リスト