Extending the Pre-Training of BLOOM for Improved Support of Traditional
Chinese: Models, Methods and Results
- URL: http://arxiv.org/abs/2303.04715v2
- Date: Fri, 23 Jun 2023 14:54:03 GMT
- Title: Extending the Pre-Training of BLOOM for Improved Support of Traditional
Chinese: Models, Methods and Results
- Authors: Philipp Ennen, Po-Chun Hsu, Chan-Jan Hsu, Chang-Le Liu, Yen-Chen Wu,
Yin-Hsiang Liao, Chin-Tung Lin, Da-Shan Shiu, Wei-Yun Ma
- Abstract summary: BLOOM-zh has its origins in the open-source BLOOM models presented by BigScience in 2022.
We extended the pre-training of BLOOM by additional 7.4 billion tokens in Traditional Chinese and English covering a variety of domains.
BLOOM-zh outperforms its predecessor on most Traditional Chinese benchmarks while maintaining its English capability.
- Score: 12.00277814051069
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper we present the multilingual language model BLOOM-zh that
features enhanced support for Traditional Chinese. BLOOM-zh has its origins in
the open-source BLOOM models presented by BigScience in 2022. Starting from
released models, we extended the pre-training of BLOOM by additional 7.4
billion tokens in Traditional Chinese and English covering a variety of domains
such as news articles, books, encyclopedias, educational materials as well as
spoken language. In order to show the properties of BLOOM-zh, both existing and
newly created benchmark scenarios are used for evaluating the performance.
BLOOM-zh outperforms its predecessor on most Traditional Chinese benchmarks
while maintaining its English capability. We release all our models to the
research community.
Related papers
- WenyanGPT: A Large Language Model for Classical Chinese Tasks [36.380841559581945]
Existing natural language processing models primarily optimize for Modern Chinese, resulting in inadequate performance on Classical Chinese.
By continuing pre-training and instruction fine-tuning on the LLaMA3-8B-Chinese model, we construct a large language model, WenyanGPT, which is specifically designed for Classical Chinese tasks.
arXiv Detail & Related papers (2025-04-29T10:19:05Z) - MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish [17.36441080071885]
This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish.
Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models.
arXiv Detail & Related papers (2024-12-21T05:50:48Z) - Tele-FLM Technical Report [96.19923831660266]
We introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model.
It features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities.
It is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B.
arXiv Detail & Related papers (2024-04-25T14:34:47Z) - Bailong: Bilingual Transfer Learning based on QLoRA and Zip-tie Embedding [0.0]
Large language models (LLMs) have demonstrated exceptional performance in various NLP applications.
The majority of open-source LLMs are pre-trained primarily on English data and little part of other languages.
We present Bailong, a fine-tuned version of Bailong 7B optimized for multi-turn dialogue scenarios.
arXiv Detail & Related papers (2024-04-01T02:04:44Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment
in Central Philippine Languages [8.64545246732563]
We introduce and release BasahaCorpus as part of an initiative aimed at expanding available corpora and baseline models for readability assessment in lower resource languages in the Philippines.
We compiled a corpus of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and Rinconada languages.
We propose a new hierarchical cross-lingual modeling approach that takes advantage of a language's placement in the family tree to increase the amount of available training data.
arXiv Detail & Related papers (2023-10-17T21:05:20Z) - Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - WYWEB: A NLP Evaluation Benchmark For Classical Chinese [10.138128038929237]
We introduce the WYWEB evaluation benchmark, which consists of nine NLP tasks in classical Chinese.
We evaluate the existing pre-trained language models, which are all struggling with this benchmark.
arXiv Detail & Related papers (2023-05-23T15:15:11Z) - Investigating the Translation Performance of a Large Multilingual
Language Model: the Case of BLOOM [8.858671209228536]
We focus on BLOOM's multilingual ability by evaluating its machine translation performance across several datasets.
We study several aspects including prompt design, model sizes, cross-lingual transfer and the use of discursive context.
arXiv Detail & Related papers (2023-03-03T13:23:42Z) - BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting [50.24676567971536]
The BLOOM model is a large publicly available multilingual language model, but its pretraining was limited to 46 languages.
We apply existing language adaptation strategies to BLOOM and benchmark its zero-shot prompting performance on eight new languages.
We conclude that with sufficient training data language adaptation can generalize well to diverse languages.
arXiv Detail & Related papers (2022-12-19T15:24:45Z) - BLOOM: A 176B-Parameter Open-Access Multilingual Language Model [264.96498474333697]
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions.
We present BLOOM, a 176B- parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers.
BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages.
arXiv Detail & Related papers (2022-11-09T18:48:09Z) - Revisiting and Advancing Chinese Natural Language Understanding with
Accelerated Heterogeneous Knowledge Pre-training [25.510288465345592]
Unlike English, there is a lack of high-performing open-source Chinese KEPLMs in the natural language processing (NLP) community to support various language understanding applications.
Here, we revisit and advance the development of Chinese natural language understanding with a series of novel Chinese KEPLMs released in various parameter sizes.
Specifically, both relational and linguistic knowledge is effectively injected into CKBERT based on two novel pre-training tasks.
arXiv Detail & Related papers (2022-10-11T09:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.