BBT-Fin: Comprehensive Construction of Chinese Financial Domain
Pre-trained Language Model, Corpus and Benchmark
- URL: http://arxiv.org/abs/2302.09432v1
- Date: Sat, 18 Feb 2023 22:20:37 GMT
- Title: BBT-Fin: Comprehensive Construction of Chinese Financial Domain
Pre-trained Language Model, Corpus and Benchmark
- Authors: Dakuan Lu, Jiaqing Liang, Yipei Xu, Qianyu He, Yipeng Geng, Mengkun
Han, Yingsi Xin, Hengkui Wu, Yanghua Xiao
- Abstract summary: We introduce BBT-FinT5, a new Chinese financial pre-training language model based on the T5 model.
To support this effort, we have built BBT-FinCorpus, a large-scale financial corpus with approximately 300GB of raw text from four different sources.
- Score: 12.457193087920183
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: To advance Chinese financial natural language processing (NLP), we introduce
BBT-FinT5, a new Chinese financial pre-training language model based on the T5
model. To support this effort, we have built BBT-FinCorpus, a large-scale
financial corpus with approximately 300GB of raw text from four different
sources. In general domain NLP, comprehensive benchmarks like GLUE and
SuperGLUE have driven significant advancements in language model pre-training
by enabling head-to-head comparisons among models. Drawing inspiration from
these benchmarks, we propose BBT-CFLEB, a Chinese Financial Language
understanding and generation Evaluation Benchmark, which includes six datasets
covering both understanding and generation tasks. Our aim is to facilitate
research in the development of NLP within the Chinese financial domain. Our
model, corpus and benchmark are released at
https://github.com/ssymmetry/BBT-FinCUGE-Applications. Our work belongs to the
Big Bang Transformer (BBT), a large-scale pre-trained language model project.
Related papers
- Tele-FLM Technical Report [96.19923831660266]
We introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model.
It features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities.
It is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B.
arXiv Detail & Related papers (2024-04-25T14:34:47Z) - FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models [18.280762424107408]
FinTral is a suite of state-of-the-art multimodal large language models (LLMs) built upon the Mistral-7b model.
We enhance FinTral with domain-specific pretraining, instruction fine-tuning, and RLAIF training.
Our FinTral model trained with direct preference optimization employing advanced Tools and Retrieval methods, dubbed FinTral-DPO-T&R, demonstrates an exceptional zero-shot performance.
arXiv Detail & Related papers (2024-02-16T05:05:12Z) - CFBenchmark: Chinese Financial Assistant Benchmark for Large Language Model [22.127509074325324]
Large language models (LLMs) have demonstrated great potential in the financial domain.
In this work, we introduce CFBenchmark, to evaluate the performance of LLMs for Chinese financial assistant.
arXiv Detail & Related papers (2023-11-10T01:12:03Z) - FinGPT: Large Generative Models for a Small Language [48.46240937758779]
We create large language models (LLMs) for Finnish, a language spoken by less than 0.1% of the world population.
We train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT.
We continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI.
arXiv Detail & Related papers (2023-11-03T08:05:04Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple
Experts Fine-tuning [74.99318727786337]
We propose Multiple Experts Fine-tuning Framework to build a financial large language model (LLM)
We build a financial instruction-tuning dataset named DISC-FIN-SFT, including instruction samples of four categories (consulting, NLP tasks, computing and retrieval-augmented generation)
Evaluations conducted on multiple benchmarks demonstrate that our model performs better than baseline models in various financial scenarios.
arXiv Detail & Related papers (2023-10-23T11:33:41Z) - CFGPT: Chinese Financial Assistant with Large Language Model [21.54229667774752]
We present a Chinese Financial Generative Pre-trained Transformer framework, named CFGPT.
CFData comprises both a pre-training dataset and a supervised fine-tuning dataset.
CFLLM is trained on CFData in two stage, continued pre-training and supervised fine-tuning.
arXiv Detail & Related papers (2023-09-19T14:34:01Z) - PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark
for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data.
We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks.
We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z) - WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model
for Financial Domain [42.093876880881886]
We propose a novel domain specific Financial LANGuage model (FLANG)
It uses financial keywords and phrases for better masking, together with span boundary objective and in-filing objective.
Our models, code and benchmark data are publicly available on Github and Huggingface.
arXiv Detail & Related papers (2022-10-31T18:35:18Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - FinBERT: A Pretrained Language Model for Financial Communications [25.900063840368347]
There is no pretrained finance specific language models available.
We address the need by pretraining a financial domain specific BERT models, FinBERT, using a large scale of financial communication corpora.
Experiments on three financial sentiment classification tasks confirm the advantage of FinBERT over generic domain BERT model.
arXiv Detail & Related papers (2020-06-15T02:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.