IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models
- URL: http://arxiv.org/abs/2406.12386v1
- Date: Tue, 18 Jun 2024 08:18:18 GMT
- Title: IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models
- Authors: Qiyao Wang, Jianguo Huang, Shule Lu, Yuan Lin, Kan Xu, Liang Yang, Hongfei Lin,
- Abstract summary: IPEval comprises 2657 multiple-choice questions across four major dimensions: creation, application, protection, and management of IP.
Evaluation methods include zero-shot, 5-few-shot, and Chain of Thought (CoT) for seven LLM types, predominantly in English or Chinese.
Results show superior English performance by models like GPT series and Qwen series, while Chinese-centric LLMs excel in Chinese tests.
- Score: 13.103862590594705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid development of Large Language Models (LLMs) in vertical domains, including intellectual property (IP), lacks a specific evaluation benchmark for assessing their understanding, application, and reasoning abilities. To fill this gap, we introduce IPEval, the first evaluation benchmark tailored for IP agency and consulting tasks. IPEval comprises 2657 multiple-choice questions across four major dimensions: creation, application, protection, and management of IP. These questions span patent rights (inventions, utility models, designs), trademarks, copyrights, trade secrets, and other related laws. Evaluation methods include zero-shot, 5-few-shot, and Chain of Thought (CoT) for seven LLM types, predominantly in English or Chinese. Results show superior English performance by models like GPT series and Qwen series, while Chinese-centric LLMs excel in Chinese tests, albeit specialized IP LLMs lag behind general-purpose ones. Regional and temporal aspects of IP underscore the need for LLMs to grasp legal nuances and evolving laws. IPEval aims to accurately gauge LLM capabilities in IP and spur development of specialized models. Website: \url{https://ipeval.github.io/}
Related papers
- LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence [74.05988707492058]
Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making.<n>Existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs)<n>We propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs.
arXiv Detail & Related papers (2025-12-04T08:48:02Z) - Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective [53.594353527056775]
We propose Chinese Commonsense Multi-hop Reasoning ( CCMOR) to evaluate Large Language Models (LLMs)<n> CCMOR is designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning.<n>We implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions.
arXiv Detail & Related papers (2025-10-09T20:29:00Z) - Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z) - PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.
Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z) - IberBench: LLM Evaluation on Iberian Languages [2.3034630097498883]
Large Language Models (LLMs) are difficult to evaluate comprehensively, particularly for languages other than English.
We present IberBench, a benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks.
We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations.
arXiv Detail & Related papers (2025-04-23T17:48:25Z) - IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property [30.847397185230538]
We introduce the first comprehensive IP task taxonomy and a large, diverse bilingual benchmark, IPBench.
This benchmark is designed to evaluate LLMs in real-world intellectual property applications, encompassing both understanding and generation.
We benchmark 16 LLMs, ranging from general-purpose to domain-specific models, and find that even the best-performing model achieves only 75.8% accuracy.
arXiv Detail & Related papers (2025-04-22T02:00:41Z) - ChatGPT or A Silent Everywhere Helper: A Survey of Large Language Models [0.0]
Chat Generative Pre-trained Transformer (ChatGPT) standing out as a notable exampledue to its advanced capabilities and widespread applications.
This survey provides a comprehensive analysis of ChatGPT, exploring its architecture, training processes, and functionalities.
arXiv Detail & Related papers (2025-03-19T22:55:08Z) - InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws.
We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries.
InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z) - Unveiling the Competitive Dynamics: A Comparative Evaluation of American and Chinese LLMs [13.112787482203286]
This study provides a comparative evaluation of American and Chinese Large Language Models (LLMs) in both English and Chinese contexts.
We proposed a comprehensive evaluation framework that encompasses natural language proficiency, disciplinary expertise, and safety and responsibility.
Our key findings show that GPT 4-Turbo is at the forefront in English contexts, whereas Ernie-Bot 4 stands out in Chinese contexts.
arXiv Detail & Related papers (2024-05-09T15:39:19Z) - PatentGPT: A Large Language Model for Intellectual Property [26.31216865513109]
Large language models (LLMs) have attracted significant attention due to their exceptional performance across a multitude of natural language process tasks.
However, the application of large language models in the Intellectual Property (IP) domain is challenging due to the strong need for specialized knowledge.
We present for the first time a low-cost, standardized procedure for training IP-oriented LLMs, meeting the unique requirements of the IP domain.
arXiv Detail & Related papers (2024-04-28T17:36:43Z) - Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT [4.574416868427695]
This paper explores the efficacy of large language models (LLMs) for Persian.
We present the first comprehensive benchmarking study of LLMs across diverse Persian language tasks.
arXiv Detail & Related papers (2024-04-03T02:12:29Z) - MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in
Intellectual Property [51.43412400869531]
Large language models (LLMs) have demonstrated impressive performance in various natural language processing (NLP) tasks.
We contribute a new benchmark, the first Multilingual-oriented quiZ on Intellectual Property (MoZIP), for the evaluation of LLMs in the IP domain.
We also develop a new IP-oriented multilingual large language model (called MoZi), which is a BLOOMZ-based model that has been supervised fine-tuned with multilingual IP-related text data.
arXiv Detail & Related papers (2024-02-26T08:27:50Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model
Collaboration [52.57055162778548]
Legal Judgment Prediction (LJP) has become an increasingly crucial task in Legal AI.
Precedents are the previous legal cases with similar facts, which are the basis for the judgment of the subsequent case in national legal systems.
Recent advances in deep learning have enabled a variety of techniques to be used to solve the LJP task.
arXiv Detail & Related papers (2023-10-13T16:47:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.