Fugu-MT 論文翻訳(概要): AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

論文の概要: AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

arxiv url: http://arxiv.org/abs/2601.06395v1
Date: Sat, 10 Jan 2026 02:39:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-13 19:08:00.792937
Title: AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages
Title（参考訳）: AfriqueLLM: アフリカの言語におけるデータ混合とモデルアーキテクチャの影響
Authors: Hao Yu, Tianyi Xu, Michael A. Hedderich, Wassim Hamidouche, Syed Waqas Zamir, David Ifeoluwa Adelani,
Abstract要約: 大規模言語モデル(LLM)は多言語化が進んでいるが、オープンモデルはプロプライエタリなシステムに比べて性能が劣っている。我々は26Bトークン上でCPTを通じて20のアフリカ言語に適応したオープンLLMスイートである textttAfriqueLLM を提示する。
参考スコア（独自算出の注目度）: 30.309928265469427
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly multilingual, yet open models continue to underperform relative to proprietary systems, with the gap most pronounced for African languages. Continued pre-training (CPT) offers a practical route to language adaptation, but improvements on demanding capabilities such as mathematical reasoning often remain limited. This limitation is driven in part by the uneven domain coverage and missing task-relevant knowledge that characterize many low-resource language corpora. We present \texttt{AfriqueLLM}, a suite of open LLMs adapted to 20 African languages through CPT on 26B tokens. We perform a comprehensive empirical study across five base models spanning sizes and architectures, including Llama 3.1, Gemma 3, and Qwen 3, and systematically analyze how CPT data composition shapes downstream performance. In particular, we vary mixtures that include math, code, and synthetic translated data, and evaluate the resulting models on a range of multilingual benchmarks. Our results identify data composition as the primary driver of CPT gains. Adding math, code, and synthetic translated data yields consistent improvements, including on reasoning-oriented evaluations. Within a fixed architecture, larger models typically improve performance, but architectural choices dominate scale when comparing across model families. Moreover, strong multilingual performance in the base model does not reliably predict post-CPT outcomes; robust architectures coupled with task-aligned data provide a more dependable recipe. Finally, our best models improve long-context performance, including document-level translation. Models have been released on [Huggingface](https://huggingface.co/collections/McGill-NLP/afriquellm).
Abstract（参考訳）: 大規模言語モデル(LLM)は多言語化が進んでいるが、オープンモデルはプロプライエタリなシステムに比べて性能が劣っている。継続事前学習(CPT)は、言語適応への実践的な経路を提供するが、数学的推論のような要求機能の改善は、しばしば制限される。この制限は、不均一なドメインカバレッジと、多くの低リソース言語コーパスを特徴付けるタスク関連知識の欠如によっても引き起こされる。我々は,26Bトークン上のCPTを通じて20のアフリカ言語に適応したオープンなLLMスイートである‘texttt{AfriqueLLM} を提示する。我々は,Llama 3.1,Gemma 3,Qwen 3を含むサイズとアーキテクチャにまたがる5つのベースモデルを対象とした総合的な実証的研究を行い,CPTデータ構成が下流のパフォーマンスをどのように形成するかを体系的に分析した。特に、数学、コード、合成されたデータを含む混合体を多言語ベンチマークで評価する。この結果から,データ合成がCPTゲインの主要なドライバであることがわかった。数学、コード、合成されたデータを追加すると、推論指向の評価を含む一貫した改善が得られる。固定アーキテクチャでは、より大きなモデルでは通常パフォーマンスが向上するが、モデルファミリをまたいだ比較では、アーキテクチャの選択がスケールを支配している。さらに、ベースモデルにおける強い多言語のパフォーマンスは、CPT後の結果を確実に予測しない。最後に、文書レベルの翻訳を含む、私たちの最高のモデルにより、長文のパフォーマンスが向上する。 Huggingface](https://huggingface.co/collections/McGill-NLP/afriquellm)でモデルがリリースされた。

論文の概要: AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

関連論文リスト