Related papers: InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models

InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models

URL: http://arxiv.org/abs/2501.10943v1
Date: Sun, 19 Jan 2025 04:53:20 GMT
Title: InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models
Authors: Jing Ding, Kai Feng, Binbin Lin, Jiarui Cai, Qiushi Wang, Yu Xie, Xiaojin Zhang, Zhongyu Wei, Wei Chen,
Abstract summary: InsQABench is a benchmark dataset for the Chinese insurance sector.<n>It is structured into three categories: Insurance Commonsense Knowledge, Insurance Structured Database, and Insurance Unstructured Documents.<n> Evaluations show that fine-tuning on InsQABench significantly improves performance.
Score: 29.948490682244923
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The application of large language models (LLMs) has achieved remarkable success in various fields, but their effectiveness in specialized domains like the Chinese insurance industry remains underexplored. The complexity of insurance knowledge, encompassing specialized terminology and diverse data types, poses significant challenges for both models and users. To address this, we introduce InsQABench, a benchmark dataset for the Chinese insurance sector, structured into three categories: Insurance Commonsense Knowledge, Insurance Structured Database, and Insurance Unstructured Documents, reflecting real-world insurance question-answering tasks.We also propose two methods, SQL-ReAct and RAG-ReAct, to tackle challenges in structured and unstructured data tasks. Evaluations show that while LLMs struggle with domain-specific terminology and nuanced clause texts, fine-tuning on InsQABench significantly improves performance. Our benchmark establishes a solid foundation for advancing LLM applications in the insurance domain, with data and code available at https://github.com/HaileyFamo/InsQABench.git.

Related papers

DMind Benchmark: The First Comprehensive Benchmark for LLM Evaluation in the Web3 Domain [4.419596289222511]
We introduce DMind Benchmark, a novel framework that systematically tests Large Language Models (LLMs) across nine key categories. DMind Benchmark goes beyond conventional multiple-choice questions by incorporating domain-specific subjective tasks. We evaluate fifteen popular LLMs on DMind Benchmark, uncovering performance gaps in Web3-specific reasoning and application.
arXiv Detail & Related papers (2025-04-18T16:40:39Z)
LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs -- No Silver Bullet for LC or RAG Routing [70.35888047551643]
We present LaRA, a novel benchmark specifically designed to rigorously compare RAG and LC LLMs. LaRA encompasses 2326 test cases across four practical QA task categories and three types of naturally occurring long texts. We find that the optimal choice between RAG and LC depends on a complex interplay of factors, including the model's parameter size, long-text capabilities, context length, task type, and the characteristics of the retrieved chunks.
arXiv Detail & Related papers (2025-02-14T08:04:22Z)
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity [23.32017147274093]
SecBench is a benchmarking dataset designed to evaluate Large Language Models (LLMs) in the cybersecurity domain.<n>The dataset was constructed by collecting high-quality data from open sources and organizing a Cybersecurity Question Design Contest.<n> Benchmarking results on 16 SOTA LLMs demonstrate the usability of SecBench.
arXiv Detail & Related papers (2024-12-30T08:11:54Z)
Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain [6.599755599064449]
Generic pre-trained neural networks may struggle to produce good results in specialized domains like finance and insurance.<n>This is due to a domain mismatch between training data and downstream tasks, as in-domain data are often scarce due to privacy constraints.<n>We show that using domain-relevant documents improves results on a named-entity recognition problem using an anonymized dataset of insurance-related financial documents.
arXiv Detail & Related papers (2024-12-12T15:09:44Z)
Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs [64.83462841029089]
We introduce an efficient merging-based alignment method called textscMergeAlign that interpolates the domain and alignment vectors, creating safer domain-specific models. We apply textscMergeAlign on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks.
arXiv Detail & Related papers (2024-11-11T09:32:20Z)
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains. BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios. With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models [51.468732121824125]
Large language models have achieved remarkable success on general NLP tasks, but they may fall short for domain-specific problems. Existing evaluation tools only provide a few baselines and evaluate them on various domains without mining the depth of domain knowledge. In this paper, we address the challenges of evaluating RALLMs by introducing the R-Eval toolkit, a Python toolkit designed to streamline the evaluation of different RAGs.
arXiv Detail & Related papers (2024-06-17T15:59:49Z)
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance [51.36387171207314]
We propose INS-MMBench, the first comprehensive LVLMs benchmark tailored for the insurance domain. INS-MMBench comprises a total of 2.2K thoroughly designed multiple-choice questions, covering 12 meta-tasks and 22 fundamental tasks. This evaluation provides an in-depth performance analysis of current LVLMs on various multimodal tasks in the insurance domain.
arXiv Detail & Related papers (2024-06-13T13:31:49Z)
Harnessing GPT-4V(ision) for Insurance: A Preliminary Exploration [51.36387171207314]
Insurance involves a wide variety of data forms in its operational processes, including text, images, and videos. GPT-4V exhibits remarkable abilities in insurance-related tasks, demonstrating a robust understanding of multimodal content. However, GPT-4V struggles with detailed risk rating and loss assessment, suffers from hallucination in image understanding, and shows variable support for different languages.
arXiv Detail & Related papers (2024-04-15T11:45:30Z)
When Giant Language Brains Just Aren't Enough! Domain Pizzazz with Knowledge Sparkle Dust [15.484175299150904]
This paper presents an empirical analysis aimed at bridging the gap in adapting large language models to practical use cases. We select the question answering (QA) task of insurance as a case study due to its challenge of reasoning. Based on the task we design a new model relied on LLMs which are empowered by additional knowledge extracted from insurance policy rulebooks and DBPedia.
arXiv Detail & Related papers (2023-05-12T03:49:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.