Related papers: INSEva: A Comprehensive Chinese Benchmark for Large Language Models in Insurance

INSEva: A Comprehensive Chinese Benchmark for Large Language Models in Insurance

URL: http://arxiv.org/abs/2509.04455v1
Date: Wed, 27 Aug 2025 03:13:40 GMT
Title: INSEva: A Comprehensive Chinese Benchmark for Large Language Models in Insurance
Authors: Shisong Chen, Qian Zhu, Wenyan Yang, Chengyi Yang, Zhong Wang, Ping Wang, Xuan Lin, Bo Xu, Daqian Li, Chao Yuan, Licai Qi, Wanqing Xu, sun zhenxing, Xin Lu, Shiqiang Xiong, Chao Chen, Haixiang Hu, Yanghua Xiao,
Abstract summary: INSEva is a Chinese benchmark specifically designed for evaluating AI systems' knowledge and capabilities in insurance.<n> INSEva features a multi-dimensional evaluation taxonomy covering business areas, task formats, difficulty levels, and cognitive-knowledge dimension.<n>Our benchmark implements tailored evaluation methods for assessing both faithfulness and completeness in open-ended responses.
Score: 48.22571187209047
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Insurance, as a critical component of the global financial system, demands high standards of accuracy and reliability in AI applications. While existing benchmarks evaluate AI capabilities across various domains, they often fail to capture the unique characteristics and requirements of the insurance domain. To address this gap, we present INSEva, a comprehensive Chinese benchmark specifically designed for evaluating AI systems' knowledge and capabilities in insurance. INSEva features a multi-dimensional evaluation taxonomy covering business areas, task formats, difficulty levels, and cognitive-knowledge dimension, comprising 38,704 high-quality evaluation examples sourced from authoritative materials. Our benchmark implements tailored evaluation methods for assessing both faithfulness and completeness in open-ended responses. Through extensive evaluation of 8 state-of-the-art Large Language Models (LLMs), we identify significant performance variations across different dimensions. While general LLMs demonstrate basic insurance domain competency with average scores above 80, substantial gaps remain in handling complex, real-world insurance scenarios. The benchmark will be public soon.

Related papers

Decision Quality Evaluation Framework at Pinterest [0.36944296923226316]
The framework is centered on a high-trust Golden Set (GDS) curated by subject matter experts (SMEs)<n>We introduce an automated intelligent sampling pipeline that uses propensity scores to efficiently expand dataset coverage.<n>The framework enables a shift from subjective assessments to a data-driven and quantitative practice for managing content safety systems.
arXiv Detail & Related papers (2026-02-17T18:45:55Z)
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models [54.80460603255789]
We introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era.<n>OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories.<n>In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories.
arXiv Detail & Related papers (2025-11-13T13:18:27Z)
Design, Results and Industry Implications of the World's First Insurance Large Language Model Evaluation Benchmark [9.636604321949322]
This paper elaborates on the construction methodology, multi-dimensional evaluation system, and underlying design philosophy of CUFEInse v1.0.<n>A comprehensive evaluation was conducted on 11 mainstream large language models.
arXiv Detail & Related papers (2025-11-11T03:19:35Z)
USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models [31.412080488801507]
Unified Safety Benchmarks (USB) is one of the most comprehensive evaluation benchmarks in MLLM safety.<n>Our benchmark features high-quality queries, extensive risk categories, comprehensive modal combinations, and encompasses both vulnerability and oversensitivity evaluations.
arXiv Detail & Related papers (2025-05-26T08:39:14Z)
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models [7.054112690519648]
CHiSafetyBench is a safety benchmark for evaluating large language models' capabilities in identifying risky content and refusing answering risky questions in Chinese contexts. This dataset comprises two types of tasks: multiple-choice questions and question-answering, evaluating LLMs from the perspectives of risk content identification and the ability to refuse answering risky questions respectively. Our experiments reveal the varying performance of different models across various safety domains, indicating that all models possess considerable potential for improvement in Chinese safety capabilities.
arXiv Detail & Related papers (2024-06-14T06:47:40Z)
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance [51.36387171207314]
Large Vision-Language Models (LVLMs) and Multimodal Large Language Models (MLLMs) have shown increasing promise in specialized domains.<n>This study systematically reviews and categorizes multimodal tasks for 4 representative types of insurance: auto, property, health, and agricultural.<n>We benchmark 11 leading LVLMs, including closed-source models such as GPT-4o and open-source models like LLaVA.
arXiv Detail & Related papers (2024-06-13T13:31:49Z)
CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility [62.74405775089802]
We present CValues, the first Chinese human values evaluation benchmark to measure the alignment ability of LLMs. As a result, we have manually collected adversarial safety prompts across 10 scenarios and induced responsibility prompts from 8 domains. Our findings suggest that while most Chinese LLMs perform well in terms of safety, there is considerable room for improvement in terms of responsibility.
arXiv Detail & Related papers (2023-07-19T01:22:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.