Fugu-MT 論文翻訳(概要): HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

論文の概要: HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

arxiv url: http://arxiv.org/abs/2511.01066v2
Date: Wed, 05 Nov 2025 13:19:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-06 13:56:26.17257
Title: HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models
Title（参考訳）: HPLT 3.0: LLMとMTのための大規模多言語リソース
Authors: Stephan Oepen, Nikolay Arefev, Mikko Aulamo, Marta Bañón, Maja Buljan, Laurie Burchell, Lucas Charpentier, Pinzhen Chen, Mariya Fedorova, Ona de Gibert, Barry Haddow, Jan Hajič, Jindřich Helcl, Andrey Kutuzov, Veronika Laippala, Zihao Li, Risto Luukkonen, Bhavitvya Malik, Vladislav Mikhailov, Amanda Myntti, Dayyán O'Brien, Lucie Poláková, Sampo Pyysalo, Gema Ramírez Sánchez, Janine Siewert, Pavel Stepachev, Jörg Tiedemann, Teemu Vahtola, Dušan Variš, Fedor Vitiugin, Tea Vojtěchová, Jaume Zaragoza,
Abstract要約: 約200の言語に対して、オープンで、非常に大きく、高品質で、リッチな注釈付きテキストデータセットを提供するためのイニシアティブを提示します。 30兆のトークンで、これはおそらくLLM事前学習データの多言語収集としては最大である。 57種類のモノリンガルエンコーダ-デコーダモデルと、少数のモノリンガルGPT様参照モデルを訓練し、評価する。
参考スコア（独自算出の注目度）: 25.953042884928006
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.
Abstract（参考訳）: 約200の言語に対して、オープンで、非常に大きく、高品質で、リッチに注釈付けされたテキストデータセットを提供するために、現在進行中のイニシアティブを提示します。 30兆のトークンで、これはおそらくLLM事前学習データの多言語収集としては最大である。これらのデータセットは、異なるソースからのWebクロールから派生したもので、Webアーカイブからのドキュメント選択のための完全なオープンソースパイプライン、HTMLからのテキスト抽出、ノイズの多いテキストに対する言語識別、正確でほぼ重複したアノテーション、レジスタラベル、テキスト品質推定、個人が特定可能な情報、ファイナルセレクションとフィルタリングが付属している。コントラストおよび解析統計によるデータ品質調査、24言語サンプルの手動検査、およびこのデータに基づいて訓練された各種言語モデルアーキテクチャのエンドツーエンド評価を通じて、データ品質調査について報告する。マルチリンガルLLM評価には、ネイティブに作成されたタスク、迅速な感度を緩和するメカニズム、より洗練された正規化とスコアの集約など、9つのヨーロッパ言語のためのベンチマークの包括的なコレクションを提供する。さらに,57種類のモノリンガルエンコーダ-デコーダモデルと,少数のモノリンガルGPT様参照モデルを訓練し,評価する。モノリンガルデータとモデルに加えて、このデータから自動的に抽出される並列テキストのコレクションと、機械翻訳によって合成される新しい並列コーパスも提示する。

論文の概要: HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

関連論文リスト