Related papers: KemenkeuGPT: Leveraging a Large Language Model on Indonesia's Government Financial Data and Regulations to Enhance Decision Making

KemenkeuGPT: Leveraging a Large Language Model on Indonesia's Government Financial Data and Regulations to Enhance Decision Making

URL: http://arxiv.org/abs/2407.21459v1
Date: Wed, 31 Jul 2024 09:16:33 GMT
Title: KemenkeuGPT: Leveraging a Large Language Model on Indonesia's Government Financial Data and Regulations to Enhance Decision Making
Authors: Gilang Fajar Febrian, Grazziela Figueredo,
Abstract summary: This study investigates the potential of Large Language Models to address Indonesia's financial data and regulations. This study undertakes an iterative process to develop KemenkeuGPT using the LangChain with Retrieval-Augmented Generation (RAG), prompt engineering and fine-tuning. The model's accuracy improved from 35% to 61%, with correctness increasing from 48% to 64%.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data is crucial for evidence-based policymaking and enhancing public services, including those at the Ministry of Finance of the Republic of Indonesia. However, the complexity and dynamic nature of governmental financial data and regulations can hinder decision-making. This study investigates the potential of Large Language Models (LLMs) to address these challenges, focusing on Indonesia's financial data and regulations. While LLMs are effective in the financial sector, their use in the public sector in Indonesia is unexplored. This study undertakes an iterative process to develop KemenkeuGPT using the LangChain with Retrieval-Augmented Generation (RAG), prompt engineering and fine-tuning. The dataset from 2003 to 2023 was collected from the Ministry of Finance, Statistics Indonesia and the International Monetary Fund (IMF). Surveys and interviews with Ministry officials informed, enhanced and fine-tuned the model. We evaluated the model using human feedback, LLM-based evaluation and benchmarking. The model's accuracy improved from 35% to 61%, with correctness increasing from 48% to 64%. The Retrieval-Augmented Generation Assessment (RAGAS) framework showed that KemenkeuGPT achieved 44% correctness with 73% faithfulness, 40% precision and 60% recall, outperforming several other base models. An interview with an expert from the Ministry of Finance indicated that KemenkeuGPT has the potential to become an essential tool for decision-making. These results are expected to improve with continuous human feedback.

Related papers

Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams [2.897171041611256]
This study introduces CMExamSet, a benchmarking dataset comprising 689 authentic multiple-choice questions from four nationally accredited CM certification exams. Results indicate that GPT-4o and Claude 3.7 surpass typical human pass thresholds (70%), with average accuracies of 82% and 83%, respectively. conceptual misunderstandings are the most common, underscoring the need for enhanced domain-specific reasoning models.
arXiv Detail & Related papers (2025-04-04T18:13:45Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data. These findings highlight the reliance on recall over rigorous logical inference. This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
Thai Financial Domain Adaptation of THaLLE -- Technical Report [0.0]
Large Language Models (LLMs) excel in general tasks but struggle with domain-specific challenges. We developed a Thai Financial LLM using the Investment Consultant (IC) exam dataset from the Stock Exchange of Thailand. The model achieved scores of 72%, 72%, and 84% on IC exam levels P1, P2, and P3, respectively.
arXiv Detail & Related papers (2024-11-27T11:30:00Z)
Evaluating Large Language Models on Financial Report Summarization: An Empirical Study [9.28042182186057]
We conduct a comparative study on three state-of-the-art Large Language Models (LLMs) Our primary motivation is to explore how these models can be harnessed within finance, a field demanding precision, contextual relevance, and robustness against erroneous or misleading information. We introduce an innovative evaluation framework that integrates both quantitative metrics (e.g., precision, recall) and qualitative analyses (e.g., contextual fit, consistency) to provide a holistic view of each model's output quality.
arXiv Detail & Related papers (2024-11-11T10:36:04Z)
MetaTrading: An Immersion-Aware Model Trading Framework for Vehicular Metaverse Services [94.61039892220037]
We propose an immersion-aware model trading framework that facilitates data provision for services while ensuring privacy through federated learning (FL) We design an incentive mechanism to incentivize metaverse users (MUs) to contribute high-value models under resource constraints. We develop a fully distributed dynamic reward algorithm based on deep reinforcement learning, without accessing any private information about MUs and other MSPs.
arXiv Detail & Related papers (2024-10-25T16:20:46Z)
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? [58.330879414174476]
We introduce DSBench, a benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG)
arXiv Detail & Related papers (2024-09-12T02:08:00Z)
Efficacy of Large Language Models in Systematic Reviews [0.0]
This study investigates the effectiveness of Large Language Models (LLMs) in interpreting existing literature. We compiled and hand-coded a database of 88 relevant papers published from March 2020 to May 2024. We evaluated two current state-of-the-art LLMs, Meta AI's Llama 3 8B and OpenAI's GPT-4o, on the accuracy of their interpretations.
arXiv Detail & Related papers (2024-08-03T00:01:13Z)
Fine-Tuning Gemma-7B for Enhanced Sentiment Analysis of Financial News Headlines [4.198715347024138]
We use Natural Language Processing (NLP) and Large Language Models (LLM) to analyze sentiment from the perspective of retail investors. We fine-tune several models, including distilbert-base-uncased, Llama, and gemma-7b, to evaluate their effectiveness in sentiment classification. Our experiments demonstrate that the fine-tuned gemma-7b model outperforms others, achieving the highest precision, recall, and F1 score.
arXiv Detail & Related papers (2024-06-19T15:20:19Z)
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs. Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z)
FinLLM-B: When Large Language Models Meet Financial Breakout Trading [13.465954970263502]
FinLLM-B is the premier large language model for financial breakout detection. We have developed a novel framework for large language models, namely multi-stage structure. Compared to GPT-3.5, FinLLM-B improves the average accuracy of answers and rational by 49.97%, with the multi-stage structure contributing 9.72% to the improvement.
arXiv Detail & Related papers (2024-02-12T10:04:07Z)
ESGReveal: An LLM-based approach for extracting structured data from ESG reports [5.467389155759699]
ESGReveal is an innovative method proposed for efficiently extracting and analyzing Environmental, Social, and Governance (ESG) data from corporate reports. This approach utilizes Large Language Models (LLM) enhanced with Retrieval Augmented Generation (RAG) techniques. Its efficacy was appraised using ESG reports from 166 companies across various sectors listed on the Hong Kong Stock Exchange in 2022.
arXiv Detail & Related papers (2023-12-25T06:44:32Z)
CSPRD: A Financial Policy Retrieval Dataset for Chinese Stock Market [61.59326951366202]
We propose a new task, policy retrieval, by introducing the Chinese Stock Policy Retrieval dataset (CSPRD) CSPRD provides 700+ passages labeled by experienced experts with relevant articles from 10k+ entries in our collected Chinese policy corpus. Our best performing baseline achieves 56.1% MRR@10, 28.5% NDCG@10, 37.5% Recall@10 and 80.6% Precision@10 on dev set.
arXiv Detail & Related papers (2023-09-08T15:40:54Z)
Biomedical image analysis competitions: The state of current participation practice [143.52578599912326]
We designed a survey to shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis. The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures.
arXiv Detail & Related papers (2022-12-16T16:44:46Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
Trends in eBusiness and eGovernment [0.0]
The first chapter is a critical review and a case study in eBusiness, with special attention to the digital currencies resource. The second chapter attempts to incorporate the UTAUT model with perceived risk theory to explore its impact on the intention to use m-government services. The third chapter aims to assess the level of gender inclusivity in the municipal e-procurement processes in the City of Johannesburg.
arXiv Detail & Related papers (2021-04-02T17:53:17Z)
Explanations of Machine Learning predictions: a mandatory step for its application to Operational Processes [61.20223338508952]
Credit Risk Modelling plays a paramount role. Recent machine and deep learning techniques have been applied to the task. We suggest to use LIME technique to tackle the explainability problem in this field.
arXiv Detail & Related papers (2020-12-30T10:27:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.