Related papers: Shai: A large language model for asset management

Shai: A large language model for asset management

URL: http://arxiv.org/abs/2312.14203v1
Date: Thu, 21 Dec 2023 05:08:57 GMT
Title: Shai: A large language model for asset management
Authors: Zhongyang Guo, Guanran Jiang, Zhongdan Zhang, Peng Li, Zhefeng Wang, and Yinchun Wang
Abstract summary: "Shai" is a 10B level large language model specifically designed for the asset management industry. Shai demonstrates enhanced performance in tasks relevant to its domain, outperforming baseline models.
Score: 8.655934598732973
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces "Shai" a 10B level large language model specifically designed for the asset management industry, built upon an open-source foundational model. With continuous pre-training and fine-tuning using a targeted corpus, Shai demonstrates enhanced performance in tasks relevant to its domain, outperforming baseline models. Our research includes the development of an innovative evaluation framework, which integrates professional qualification exams, tailored tasks, open-ended question answering, and safety assessments, to comprehensively assess Shai's capabilities. Furthermore, we discuss the challenges and implications of utilizing large language models like GPT-4 for performance assessment in asset management, suggesting a combination of automated evaluation and human judgment. Shai's development, showcasing the potential and versatility of 10B-level large language models in the financial sector with significant performance and modest computational requirements, hopes to provide practical insights and methodologies to assist industry peers in their similar endeavors.

Related papers

Design, Results and Industry Implications of the World's First Insurance Large Language Model Evaluation Benchmark [9.636604321949322]
This paper elaborates on the construction methodology, multi-dimensional evaluation system, and underlying design philosophy of CUFEInse v1.0.<n>A comprehensive evaluation was conducted on 11 mainstream large language models.
arXiv Detail & Related papers (2025-11-11T03:19:35Z)
Automated Capability Evaluation of Foundation Models [0.0]
Active learning for Capability Evaluation (ACE) is a novel framework for scalable, automated, and fine-grained evaluation of foundation models.<n>To maximize coverage and efficiency, ACE models a subject model's performance as a capability function over a latent semantic space.<n>This adaptive evaluation strategy enables cost-effective discovery of strengths, weaknesses, and failure modes that static benchmarks may miss.
arXiv Detail & Related papers (2025-05-22T19:09:57Z)
The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources [100.23208165760114]
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet.
arXiv Detail & Related papers (2024-06-24T15:55:49Z)
Can I understand what I create? Self-Knowledge Evaluation of Large Language Models [31.85129258347539]
Large language models (LLMs) have achieved remarkable progress in linguistic tasks. Inspired by Feynman's principle of understanding through creation, we introduce a self-knowledge evaluation framework.
arXiv Detail & Related papers (2024-06-10T09:53:54Z)
FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models [64.11333762954283]
This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities.
arXiv Detail & Related papers (2024-04-29T01:49:07Z)
Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing Platform [64.76104135495576]
We propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models. Through this platform, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities.
arXiv Detail & Related papers (2024-03-13T07:31:20Z)
FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets [9.714447724811842]
This paper introduces a distinctive approach anchored in the Instruction Tuning paradigm for open-source large language models. We capitalize on the interoperability of open-source models, ensuring a seamless and transparent integration. The paper presents a benchmarking scheme designed for end-to-end training and testing, employing a cost-effective progression.
arXiv Detail & Related papers (2023-10-07T12:52:58Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
Evaluating the Generation Capabilities of Large Chinese Language Models [27.598864484231477]
This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework. It assesses the generative capabilities of large Chinese language models across a spectrum of academic disciplines. Gscore automates the quality measurement of a model's text generation against reference standards.
arXiv Detail & Related papers (2023-08-09T09:22:56Z)
INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models [39.46610170563634]
INSTRUCTEVAL is a more comprehensive evaluation suite designed specifically for instruction-tuned large language models. We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods. Our findings reveal that the quality of instruction data is the most crucial factor in scaling model performance.
arXiv Detail & Related papers (2023-06-07T20:12:29Z)
Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation [12.86275938443485]
We examine the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance. We assess various models using a evaluation set of 1,000 samples, encompassing nine real-world scenarios. We extend the vocabulary of LLaMA - the model with the closest open-source performance to proprietary language models like GPT-3.
arXiv Detail & Related papers (2023-04-16T18:37:39Z)
Feeding What You Need by Understanding What You Learned [54.400455868448695]
Machine Reading (MRC) reveals the ability to understand a given text passage and answer questions based on it. Existing research works in MRC rely heavily on large-size models and corpus to improve the performance evaluated by metrics such as Exact Match. We argue that a deep understanding of model capabilities and data properties can help us feed a model with appropriate training data.
arXiv Detail & Related papers (2022-03-05T14:15:59Z)
RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems [75.87418236410296]
We introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains. RADDLE is designed to favor and encourage models with a strong generalization ability. We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain.
arXiv Detail & Related papers (2020-12-29T08:58:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.