Related papers: Domain Mastery Benchmark: An Ever-Updating Benchmark for Evaluating Holistic Domain Knowledge of Large Language Model--A Preliminary Release

Domain Mastery Benchmark: An Ever-Updating Benchmark for Evaluating Holistic Domain Knowledge of Large Language Model--A Preliminary Release

URL: http://arxiv.org/abs/2304.11679v2
Date: Thu, 10 Aug 2023 05:27:58 GMT
Title: Domain Mastery Benchmark: An Ever-Updating Benchmark for Evaluating Holistic Domain Knowledge of Large Language Model--A Preliminary Release
Authors: Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Zhuozhi Xiong, Zihan Li, Qianyu He, Sihang Jiang, Hongwei Feng, Yanghua Xiao
Abstract summary: DomMa targets at testing Large Language Models (LLMs) on their domain knowledge understanding. It features extensive domain coverage, large data volume, and a continually updated data set based on Chinese 112 first-level subject classifications.
Score: 13.603414598813938
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Domain knowledge refers to the in-depth understanding, expertise, and familiarity with a specific subject, industry, field, or area of special interest. The existing benchmarks are all lack of an overall design for domain knowledge evaluation. Holding the belief that the real ability of domain language understanding can only be fairly evaluated by an comprehensive and in-depth benchmark, we introduces the Domma, a Domain Mastery Benchmark. DomMa targets at testing Large Language Models (LLMs) on their domain knowledge understanding, it features extensive domain coverage, large data volume, and a continually updated data set based on Chinese 112 first-level subject classifications. DomMa consist of 100,000 questions in both Chinese and English sourced from graduate entrance examinations and undergraduate exams in Chinese college. We have also propose designs to make benchmark and evaluation process more suitable to LLMs.

Related papers

MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark [20.642661835794975]
We introduce MME-Industry, a novel benchmark designed specifically for evaluating MLLMs in industrial settings. The benchmark encompasses 21 distinct domain, comprising 1050 question-answer pairs with 50 questions per domain. We provide both Chinese and English versions of the benchmark, enabling comparative analysis of MLLMs' capabilities across these languages.
arXiv Detail & Related papers (2025-01-28T03:56:17Z)
Top General Performance = Top Domain Performance? DomainCodeBench: A Multi-domain Code Generation Benchmark [38.14474956762422]
We introduce DomainCodeBench, a benchmark designed to evaluate large language models (LLMs) across 12 software application domains and 15 programming languages. We find that top general-domain models do not consistently excel in specific application domains. We show that augmenting prompts with domain-specific knowledge improves performance by around 38.17%.
arXiv Detail & Related papers (2024-12-24T17:56:08Z)
Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning [55.107329995417786]
Large language models (LLMs) have demonstrated impressive general understanding and generation abilities. We establish a benchmark for multi-domain translation, featuring 25 German$Leftrightarrow$English and 22 Chinese$Leftrightarrow$English test sets. We propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance.
arXiv Detail & Related papers (2024-10-03T16:15:04Z)
Pretraining and Updates of Domain-Specific LLM: A Case Study in the Japanese Business Domain [4.133477882188227]
This paper presents our findings from training and evaluating a Japanese business domain-specific LLM. Our pretrained model and business domain benchmark are publicly available to support further studies.
arXiv Detail & Related papers (2024-04-12T06:21:48Z)
Systematic Assessment of Factual Knowledge in Large Language Models [48.75961313441549]
This paper proposes a framework to assess the factual knowledge of large language models (LLMs) by leveraging knowledge graphs (KGs) Our framework automatically generates a set of questions and expected answers from the facts stored in a given KG, and then evaluates the accuracy of LLMs in answering these questions.
arXiv Detail & Related papers (2023-10-18T00:20:50Z)
NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain [0.0]
NuclearQA is a human-made benchmark of 100 questions to evaluate language models in the nuclear domain. We show how the mix of several types of questions makes our benchmark uniquely capable of evaluating models in the nuclear domain.
arXiv Detail & Related papers (2023-10-17T01:27:20Z)
Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation [61.56563631219381]
We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge. Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi- Specialty and Xiezhi-Interdiscipline, both with 15k questions.
arXiv Detail & Related papers (2023-06-09T09:52:05Z)
Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey [100.24095818099522]
Large language models (LLMs) have significantly advanced the field of natural language processing (NLP) They provide a highly useful, task-agnostic foundation for a wide range of applications. However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles.
arXiv Detail & Related papers (2023-05-30T03:00:30Z)
PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding. LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge. We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z)
DomBERT: Domain-oriented Language Model for Aspect-based Sentiment Analysis [71.40586258509394]
We propose DomBERT, an extension of BERT to learn from both in-domain corpus and relevant domain corpora. Experiments are conducted on an assortment of tasks in aspect-based sentiment analysis, demonstrating promising results.
arXiv Detail & Related papers (2020-04-28T21:07:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.