Related papers: IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property

IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property

URL: http://arxiv.org/abs/2504.15524v2
Date: Mon, 29 Sep 2025 00:57:32 GMT
Title: IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property
Authors: Qiyao Wang, Guhong Chen, Hongbo Wang, Huaren Liu, Minghui Zhu, Zhifei Qin, Linwei Li, Yilin Yue, Shiqiang Wang, Jiayan Li, Yihang Wu, Ziqiang Liu, Longze Chen, Run Luo, Liyang Fan, Jiaming Li, Lei Zhang, Kan Xu, Chengming Li, Hamid Alinejad-Rokny, Shiwen Ni, Yuan Lin, Min Yang,
Abstract summary: IPBench is the first comprehensive IP task taxonomy and a large-scale benchmark encompassing 8 IP mechanisms and 20 distinct tasks.<n>We benchmark 17 main LLMs, ranging from general purpose to domain-specific, including chat-oriented and reasoning-focused models.<n>Our results show that even the top-performing model, DeepSeek-V3, achieves only 75.8% accuracy, indicating significant room for improvement.
Score: 53.2129505804405
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Intellectual Property (IP) is a highly specialized domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. Recent advancements in LLMs have demonstrated their potential to handle IP-related tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks focus narrowly on patents or cover limited aspects of the IP field, lacking alignment with real-world scenarios. To bridge this gap, we introduce IPBench, the first comprehensive IP task taxonomy and a large-scale bilingual benchmark encompassing 8 IP mechanisms and 20 distinct tasks, designed to evaluate LLMs in real-world IP scenarios. We benchmark 17 main LLMs, ranging from general purpose to domain-specific, including chat-oriented and reasoning-focused models, under zero-shot, few-shot, and chain-of-thought settings. Our results show that even the top-performing model, DeepSeek-V3, achieves only 75.8% accuracy, indicating significant room for improvement. Notably, open-source IP and law-oriented models lag behind closed-source general-purpose models. To foster future research, we publicly release IPBench, and will expand it with additional tasks to better reflect real-world complexities and support model advancements in the IP domain. We provide the data and code in the supplementary URLs.

Related papers

BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities [61.173773299032746]
Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world.<n>We introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities.<n> BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning.<n>We propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities.
arXiv Detail & Related papers (2025-10-09T19:18:36Z)
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [86.00932417210477]
We introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers.<n>Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching.<n>We find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations.
arXiv Detail & Related papers (2025-08-20T13:28:58Z)
RealBench: Benchmarking Verilog Generation Models with Real-World IP Designs [26.993718615404926]
We present RealBench, the first benchmark aiming at real-world IP-level Verilog generation tasks.<n>RealBench features complex, structured, real-world open-source IP designs, multi-modal and formatted design specifications, and rigorous verification environments.<n> Evaluations on various LLMs and agents reveal that even one of the best-performing LLMs, o1-preview, achieves only a 13.3% pass@1 on module-level tasks and 0% on system-level tasks.
arXiv Detail & Related papers (2025-07-22T03:29:23Z)
SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents [16.08820954102608]
Large Language Models (LLMs) demonstrate impressive general-purpose reasoning and problem-solving abilities.<n>LLMs struggle with executing complex, long-horizon that demand strict adherence to Standard Operating Procedures.<n>We develop SOP-Bench, a benchmark of over 1,800 tasks across 10 industrial domains.
arXiv Detail & Related papers (2025-06-09T18:20:12Z)
Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z)
Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z)
Enhancing Large Language Models (LLMs) for Telecommunications using Knowledge Graphs and Retrieval-Augmented Generation [52.8352968531863]
Large language models (LLMs) have made significant progress in general-purpose natural language processing tasks.<n>This paper presents a novel framework that combines knowledge graph (KG) and retrieval-augmented generation (RAG) techniques to enhance LLM performance in the telecom domain.
arXiv Detail & Related papers (2025-03-31T15:58:08Z)
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM [58.42678619252968]
Creation-MMBench is a benchmark designed to evaluate the creative capabilities of Multimodal Large Language Models.<n>The benchmark comprises 765 test cases spanning 51 fine-grained tasks.<n> Experimental results reveal that open-source MLLMs significantly underperform compared to proprietary models in creative tasks.
arXiv Detail & Related papers (2025-03-18T17:51:34Z)
Vision-Language Model IP Protection via Prompt-based Learning [52.783709712318405]
We introduce IP-CLIP, a lightweight IP protection strategy tailored to vision-language models (VLMs)<n>By leveraging the frozen visual backbone of CLIP, we extract both image style and content information, incorporating them into the learning of IP prompt.<n>This strategy acts as a robust barrier, effectively preventing the unauthorized transfer of features from authorized domains to unauthorized ones.
arXiv Detail & Related papers (2025-03-04T08:31:12Z)
LegalAgentBench: Evaluating LLM Agents in Legal Domain [53.70993264644004]
LegalAgentBench is a benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain.<n>LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge.
arXiv Detail & Related papers (2024-12-23T04:02:46Z)
Intellectual Property Protection for Deep Learning Model and Dataset Intelligence [21.757997058357]
This work systematically summarizes the general and scheme-specific performance evaluation metrics. From proactive IP infringement prevention and reactive IP ownership verification perspectives, it comprehensively investigates and analyzes the existing IPP methods. Finally, we outline prospects for promising future directions that may act as a guide for innovative research.
arXiv Detail & Related papers (2024-11-07T09:02:41Z)
IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models [13.103862590594705]
IPEval comprises 2657 multiple-choice questions across four major dimensions: creation, application, protection, and management of IP. Evaluation methods include zero-shot, 5-few-shot, and Chain of Thought (CoT) for seven LLM types, predominantly in English or Chinese. Results show superior English performance by models like GPT series and Qwen series, while Chinese-centric LLMs excel in Chinese tests.
arXiv Detail & Related papers (2024-06-18T08:18:18Z)
PatentGPT: A Large Language Model for Intellectual Property [26.31216865513109]
Large language models (LLMs) have attracted significant attention due to their exceptional performance across a multitude of natural language process tasks. However, the application of large language models in the Intellectual Property (IP) domain is challenging due to the strong need for specialized knowledge. We present for the first time a low-cost, standardized procedure for training IP-oriented LLMs, meeting the unique requirements of the IP domain.
arXiv Detail & Related papers (2024-04-28T17:36:43Z)
ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models [25.68491572293656]
Large Language Models fall short in structured knowledge extraction tasks such as named entity recognition. This paper explores an innovative, cost-efficient strategy to harness LLMs with modest NER capabilities for producing superior NER datasets.
arXiv Detail & Related papers (2024-03-17T06:12:43Z)
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property [51.43412400869531]
Large language models (LLMs) have demonstrated impressive performance in various natural language processing (NLP) tasks. We contribute a new benchmark, the first Multilingual-oriented quiZ on Intellectual Property (MoZIP), for the evaluation of LLMs in the IP domain. We also develop a new IP-oriented multilingual large language model (called MoZi), which is a BLOOMZ-based model that has been supervised fine-tuned with multilingual IP-related text data.
arXiv Detail & Related papers (2024-02-26T08:27:50Z)
Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution [48.86322922826514]
This paper defines a new task of Knowledge-aware Language Model Attribution (KaLMA) First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios. Second, we propose a new Conscious Incompetence" setting considering the incomplete knowledge repository. Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment.
arXiv Detail & Related papers (2023-10-09T11:45:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.