Related papers: An Effective Data Creation Pipeline to Generate High-quality Financial Instruction Data for Large Language Model

An Effective Data Creation Pipeline to Generate High-quality Financial Instruction Data for Large Language Model

URL: http://arxiv.org/abs/2308.01415v1
Date: Mon, 31 Jul 2023 07:23:11 GMT
Title: An Effective Data Creation Pipeline to Generate High-quality Financial Instruction Data for Large Language Model
Authors: Ziao Wang, Jianning Wang, Junda Wu, Xiaofeng Zhang
Abstract summary: This paper presents a data creation pipeline to fine-tune a large language model for financial related tasks. We initiate a dialogue between an AI investor and financial expert using ChatGPT and incorporate the feedback of human financial experts. This pipeline yielded a robust instruction tuning dataset comprised of 103k multi-turn chats.
Score: 10.589742983893787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: At the beginning era of large language model, it is quite critical to generate a high-quality financial dataset to fine-tune a large language model for financial related tasks. Thus, this paper presents a carefully designed data creation pipeline for this purpose. Particularly, we initiate a dialogue between an AI investor and financial expert using ChatGPT and incorporate the feedback of human financial experts, leading to the refinement of the dataset. This pipeline yielded a robust instruction tuning dataset comprised of 103k multi-turn chats. Extensive experiments have been conducted on this dataset to evaluate the model's performance by adopting an external GPT-4 as the judge. The promising experimental results verify that our approach led to significant advancements in generating accurate, relevant, and financial-style responses from AI models, and thus providing a powerful tool for applications within the financial sector.

Related papers

FinSight: Towards Real-World Financial Deep Research [68.31086471310773]
FinSight is a novel framework for producing high-quality, multimodal financial reports.<n>To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism.<n>A two-stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports.
arXiv Detail & Related papers (2025-10-19T14:05:35Z)
FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation [63.55583665003167]
We present FinDER, an expert-generated dataset tailored for Retrieval-Augmented Generation (RAG) in finance. FinDER focuses on annotating search-relevant evidence by domain experts, offering 5,703 query-evidence-answer triplets. By challenging models to retrieve relevant information from large corpora, FinDER offers a more realistic benchmark for evaluating RAG systems.
arXiv Detail & Related papers (2025-04-22T11:30:13Z)
ZiGong 1.0: A Large Language Model for Financial Credit [8.49779245416985]
Large Language Models (LLMs) have demonstrated strong performance across various general Natural Language Processing (NLP) tasks. However, their effectiveness in financial credit assessment applications remains suboptimal. We propose ZiGong, a Mistral-based model enhanced through multi-task supervised fine-tuning.
arXiv Detail & Related papers (2025-02-22T09:27:56Z)
RAG-IT: Retrieval-Augmented Instruction Tuning for Automated Financial Analysis [1.2891210250935148]
RAG-IT (Retrieval-Augmented Instruction Tuning) is a novel framework designed to automate the generation of earnings report analyses.<n>Our approach integrates retrieval augmentation with instruction-based fine-tuning to enhance factual accuracy, contextual relevance, and domain adaptability.
arXiv Detail & Related papers (2024-12-11T08:09:42Z)
FISHNET: Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert Swarms, and Task Planning [2.616867378362811]
FISHNET is an agentic architecture that accomplishes highly complex analytical tasks for more than 98,000 regulatory filings. FISHNET shows remarkable performance for financial insight generation.
arXiv Detail & Related papers (2024-10-25T17:53:47Z)
A Dutch Financial Large Language Model [7.443474354626664]
FinGEITje is the first Dutch financial Large Language Model (LLM) specifically designed and optimized for various financial tasks. We release a specialized Dutch financial instruction tuning dataset with over 140,000 samples, constructed employing an automated translation and data processing method. The experimental results highlight the superior performance of FinGEITje across five critical Dutch and English financial tasks.
arXiv Detail & Related papers (2024-10-03T08:38:31Z)
SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models [6.639972934967109]
Large language models (LLMs) have become powerful tools for advancing natural language processing applications in the financial industry. We propose a novel large language model specifically designed for the Chinese financial domain, named SNFinLLM. SNFinLLM excels in domain-specific tasks such as answering questions, summarizing financial research reports, analyzing sentiment, and executing financial calculations.
arXiv Detail & Related papers (2024-08-05T08:24:24Z)
AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework [48.3060010653088]
We release AlphaFin datasets, combining traditional research datasets, real-time financial data, and handwritten chain-of-thought (CoT) data. We then use AlphaFin datasets to benchmark a state-of-the-art method, called Stock-Chain, for effectively tackling the financial analysis task.
arXiv Detail & Related papers (2024-03-19T09:45:33Z)
FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z)
Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing [22.754757518792395]
FinLMEval is a framework for Financial Language Model Evaluation. This study compares the performance of encoder-only language models and the decoder-only language models.
arXiv Detail & Related papers (2023-10-19T11:43:15Z)
FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets [9.714447724811842]
This paper introduces a distinctive approach anchored in the Instruction Tuning paradigm for open-source large language models. We capitalize on the interoperability of open-source models, ensuring a seamless and transparent integration. The paper presents a benchmarking scheme designed for end-to-end training and testing, employing a cost-effective progression.
arXiv Detail & Related papers (2023-10-07T12:52:58Z)
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data. We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)
Enabling and Analyzing How to Efficiently Extract Information from Hybrid Long Documents with LLMs [48.87627426640621]
This research focuses on harnessing the potential of Large Language Models to comprehend critical information from financial reports. We propose an Automated Financial Information Extraction framework that enhances LLMs' ability to comprehend and extract information from financial reports. Our framework is effectively validated on GPT-3.5 and GPT-4, yielding average accuracy increases of 53.94% and 33.77%, respectively.
arXiv Detail & Related papers (2023-05-24T10:35:58Z)
FinQA: A Dataset of Numerical Reasoning over Financial Data [52.7249610894623]
We focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. We propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge.
arXiv Detail & Related papers (2021-09-01T00:08:14Z)
What Makes Good In-Context Examples for GPT-$3$? [101.99751777056314]
GPT-$3$ has attracted lots of attention due to its superior performance across a wide range of NLP tasks. Despite its success, we found that the empirical results of GPT-$3$ depend heavily on the choice of in-context examples. In this work, we investigate whether there are more effective strategies for judiciously selecting in-context examples.
arXiv Detail & Related papers (2021-01-17T23:38:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.