Related papers: INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

URL: http://arxiv.org/abs/2406.09105v1
Date: Thu, 13 Jun 2024 13:31:49 GMT
Title: INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
Authors: Chenwei Lin, Hanjia Lyu, Xian Xu, Jiebo Luo,
Abstract summary: We propose INS-MMBench, the first comprehensive LVLMs benchmark tailored for the insurance domain. INS-MMBench comprises a total of 2.2K thoroughly designed multiple-choice questions, covering 12 meta-tasks and 22 fundamental tasks. This evaluation provides an in-depth performance analysis of current LVLMs on various multimodal tasks in the insurance domain.
Score: 51.36387171207314
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in various general multimodal applications such as image recognition and visual reasoning, and have also shown promising potential in specialized domains. However, the application potential of LVLMs in the insurance domain-characterized by rich application scenarios and abundant multimodal data-has not been effectively explored. There is no systematic review of multimodal tasks in the insurance domain, nor a benchmark specifically designed to evaluate the capabilities of LVLMs in insurance. This gap hinders the development of LVLMs within the insurance domain. In this paper, we systematically review and distill multimodal tasks for four representative types of insurance: auto insurance, property insurance, health insurance, and agricultural insurance. We propose INS-MMBench, the first comprehensive LVLMs benchmark tailored for the insurance domain. INS-MMBench comprises a total of 2.2K thoroughly designed multiple-choice questions, covering 12 meta-tasks and 22 fundamental tasks. Furthermore, we evaluate multiple representative LVLMs, including closed-source models such as GPT-4o and open-source models like BLIP-2. This evaluation not only validates the effectiveness of our benchmark but also provides an in-depth performance analysis of current LVLMs on various multimodal tasks in the insurance domain. We hope that INS-MMBench will facilitate the further application of LVLMs in the insurance domain and inspire interdisciplinary development. Our dataset and evaluation code are available at https://github.com/FDU-INS/INS-MMBench.

Related papers

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [63.43699771428243]
EmbodiedBench is an extensive benchmark designed to evaluate vision-driven embodied agents.<n>We evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench.<n> MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average.
arXiv Detail & Related papers (2025-02-13T18:11:34Z)
MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark [20.642661835794975]
We introduce MME-Industry, a novel benchmark designed specifically for evaluating MLLMs in industrial settings. The benchmark encompasses 21 distinct domain, comprising 1050 question-answer pairs with 50 questions per domain. We provide both Chinese and English versions of the benchmark, enabling comparative analysis of MLLMs' capabilities across these languages.
arXiv Detail & Related papers (2025-01-28T03:56:17Z)
InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models [29.948490682244923]
InsQABench is a benchmark dataset for the Chinese insurance sector. It is structured into three categories: Insurance Commonsense Knowledge, Insurance Structured Database, and Insurance Unstructured Documents. Evaluations show that fine-tuning on InsQABench significantly improves performance.
arXiv Detail & Related papers (2025-01-19T04:53:20Z)
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types [21.683010095703832]
We develop a novel benchmark to assess the generalization of large language model (LLM) safety across various tasks and prompt types. This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety. Our assessment reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment.
arXiv Detail & Related papers (2024-10-29T11:47:01Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation. Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs. Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z)
Needle In A Multimodal Haystack [79.81804334634408]
We present the first benchmark specifically designed to evaluate the capability of existing MLLMs to comprehend long multimodal documents. Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning. We observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation.
arXiv Detail & Related papers (2024-06-11T13:09:16Z)
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI [71.53579367538725]
MMT-Bench is a benchmark designed to assess Large Vision-Language Models (LVLMs) across massive multimodal tasks. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios.
arXiv Detail & Related papers (2024-04-24T17:37:05Z)
Harnessing GPT-4V(ision) for Insurance: A Preliminary Exploration [51.36387171207314]
Insurance involves a wide variety of data forms in its operational processes, including text, images, and videos. GPT-4V exhibits remarkable abilities in insurance-related tasks, demonstrating a robust understanding of multimodal content. However, GPT-4V struggles with detailed risk rating and loss assessment, suffers from hallucination in image understanding, and shows variable support for different languages.
arXiv Detail & Related papers (2024-04-15T11:45:30Z)
Through the Lens of Core Competency: Survey on Evaluation of Large Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses. Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios. We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety. Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.