Related papers: Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM

Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM

URL: http://arxiv.org/abs/2502.06635v2
Date: Thu, 13 Feb 2025 07:31:55 GMT
Title: Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM
Authors: Qingshui Gu, Shu Li, Tianyu Zheng, Zhaoxiang Zhang,
Abstract summary: Steel-LLM is a Chinese-centric language model developed from scratch with the goal of creating a high-quality, open-source model.<n>This paper provides a comprehensive summary of the project's key contributions, including data collection, model design, training methodologies, and the challenges encountered along the way.
Score: 47.64519989743434
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Steel-LLM is a Chinese-centric language model developed from scratch with the goal of creating a high-quality, open-source model despite limited computational resources. Launched in March 2024, the project aimed to train a 1-billion-parameter model on a large-scale dataset, prioritizing transparency and the sharing of practical insights to assist others in the community. The training process primarily focused on Chinese data, with a small proportion of English data included, addressing gaps in existing open-source LLMs by providing a more detailed and practical account of the model-building journey. Steel-LLM has demonstrated competitive performance on benchmarks such as CEVAL and CMMLU, outperforming early models from larger institutions. This paper provides a comprehensive summary of the project's key contributions, including data collection, model design, training methodologies, and the challenges encountered along the way, offering a valuable resource for researchers and practitioners looking to develop their own LLMs. The model checkpoints and training script are available at https://github.com/zhanshijinwat/Steel-LLM.

Related papers

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models [90.46966584238682]
Most open-source vision-language models only publish their final model weights, leaving critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, showing the key role of data strategy in developing frontier VLMs. By studying and building our post-training data strategy from scratch, we share detailed insights into the development processes, aiming to benefit the development of competitive models for the open-source community.
arXiv Detail & Related papers (2025-01-20T18:40:47Z)
LLMic: Romanian Foundation Language Model [76.09455151754062]
We present LLMic, a foundation language model designed specifically for the Romanian Language. We show that fine-tuning LLMic for language translation after the initial pretraining phase outperforms existing solutions in English-to-Romanian translation tasks.
arXiv Detail & Related papers (2025-01-13T22:14:45Z)
7B Fully Open Source Moxin-LLM -- From Pretraining to GRPO-based Reinforcement Learning Enhancement [42.10844666788254]
Moxin 7B is a fully open-source Large Language Models (LLMs) developed adhering to principles of open science, open source, open data, and open access. We release the pre-training code and configurations, training and fine-tuning datasets, and intermediate and final checkpoints. Experiments show that our models achieve superior performance in various evaluations such as zero-shot evaluation, few-shot evaluation, and CoT evaluation.
arXiv Detail & Related papers (2024-12-08T02:01:46Z)
Dynamic data sampler for cross-language transfer learning in large language models [34.464472766868106]
ChatFlow is a cross-language transfer-based Large Language Models (LLMs) We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance.
arXiv Detail & Related papers (2024-05-17T08:40:51Z)
Tele-FLM Technical Report [96.19923831660266]
We introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model. It features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. It is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B.
arXiv Detail & Related papers (2024-04-25T14:34:47Z)
GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning [4.8838210812204235]
In this paper, we propose GeMQuAD - a semi-supervised learning approach, applied to a dataset generated through ICL with just one example in the target language. We iteratively identify high-quality data to enhance model performance, especially for low-resource multilingual setting. Our framework outperforms the machine translation-augmented model by 0.22/1.68 F1/EM points for Hindi and 0.82/1.37 F1/EM points for Spanish on the MLQA dataset.
arXiv Detail & Related papers (2024-04-14T06:55:42Z)
Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model [36.01840141194335]
We introduce CT-LLM, a 2B large language model (LLM) Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by incorporating Chinese textual data. CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT.
arXiv Detail & Related papers (2024-04-05T15:20:02Z)
YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z)
CLEVA: Chinese Language Models EVAluation Platform [92.42981537317817]
We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs. Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard. To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round. Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding.
arXiv Detail & Related papers (2023-08-09T09:11:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.