AI-Based Measurement of Innovation: Mapping Expert Insight into Large Language Model Applications
- URL: http://arxiv.org/abs/2508.02430v1
- Date: Mon, 04 Aug 2025 13:49:30 GMT
- Title: AI-Based Measurement of Innovation: Mapping Expert Insight into Large Language Model Applications
- Authors: Robin Nowak, Patrick Figge, Carolin Haeussler,
- Abstract summary: Large language models (LLMs) can be leveraged to overcome the constraints of manual expert evaluations.<n>We design an LLM framework that reliably approximates domain experts' assessment of innovation from unstructured text data.<n>This article equips R&D personnel in firms, as well as researchers, reviewers, and editors, with the knowledge and tools to effectively use LLMs for measuring innovation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Measuring innovation often relies on context-specific proxies and on expert evaluation. Hence, empirical innovation research is often limited to settings where such data is available. We investigate how large language models (LLMs) can be leveraged to overcome the constraints of manual expert evaluations and assist researchers in measuring innovation. We design an LLM framework that reliably approximates domain experts' assessment of innovation from unstructured text data. We demonstrate the performance and broad applicability of this framework through two studies in different contexts: (1) the innovativeness of software application updates and (2) the originality of user-generated feedback and improvement ideas in product reviews. We compared the performance (F1-score) and reliability (consistency rate) of our LLM framework against alternative measures used in prior innovation studies, and to state-of-the-art machine learning- and deep learning-based models. The LLM framework achieved higher F1-scores than the other approaches, and its results are highly consistent (i.e., results do not change across runs). This article equips R&D personnel in firms, as well as researchers, reviewers, and editors, with the knowledge and tools to effectively use LLMs for measuring innovation and evaluating the performance of LLM-based innovation measures. In doing so, we discuss, the impact of important design decisions-including model selection, prompt engineering, training data size, training data distribution, and parameter settings-on performance and reliability. Given the challenges inherent in using human expert evaluation and existing text-based measures, our framework has important implications for harnessing LLMs as reliable, increasingly accessible, and broadly applicable research tools for measuring innovation.
Related papers
- Dynamic Knowledge Exchange and Dual-diversity Review: Concisely Unleashing the Potential of a Multi-Agent Research Team [53.38438460574943]
IDVSCI is a multi-agent framework built on large language models (LLMs)<n>It incorporates two key innovations: a Dynamic Knowledge Exchange mechanism and a Dual-Diversity Review paradigm.<n>Results show that IDVSCI consistently achieves the best performance across two datasets.
arXiv Detail & Related papers (2025-06-23T07:12:08Z) - OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics [101.78963920333342]
We introduce OpenUnlearning, a standardized framework for benchmarking large language models (LLMs) unlearning methods and metrics.<n>OpenUnlearning integrates 9 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks.<n>We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite.
arXiv Detail & Related papers (2025-06-14T20:16:37Z) - MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models [11.809732662992982]
This paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate Large Language Models (LLMs) performance in the Model Context Protocol (MCP) framework.<n>Unlike conventional benchmarks that rely on subjective human evaluations or binary success metrics, MCP-RADAR employs objective, quantifiable measurements across multiple task domains.
arXiv Detail & Related papers (2025-05-22T14:02:37Z) - Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z) - IdeaBench: Benchmarking Large Language Models for Research Idea Generation [19.66218274796796]
Large Language Models (LLMs) have transformed how people interact with artificial intelligence (AI) systems.
We propose IdeaBench, a benchmark system that includes a comprehensive dataset and an evaluation framework.
Our dataset comprises titles and abstracts from a diverse range of influential papers, along with their referenced works.
Our evaluation framework is a two-stage process: first, using GPT-4o to rank ideas based on user-specified quality indicators such as novelty and feasibility, enabling scalable personalization.
arXiv Detail & Related papers (2024-10-31T17:04:59Z) - Evaluating the Impact of Advanced LLM Techniques on AI-Lecture Tutors for a Robotics Course [0.35132421583441026]
This study evaluates the performance of Large Language Models (LLMs) as an Artificial Intelligence-based tutor for a university course.
In particular, different advanced techniques are utilized, such as prompt engineering, Retrieval-Augmented-Generation (RAG), and fine-tuning.
Our findings indicate that RAG combined with prompt engineering significantly enhances model responses and produces better factual answers.
arXiv Detail & Related papers (2024-08-02T19:49:19Z) - Systematic Task Exploration with LLMs: A Study in Citation Text Generation [63.50597360948099]
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks.
We propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement.
We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric.
arXiv Detail & Related papers (2024-07-04T16:41:08Z) - Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark [12.729687989535359]
evaluating Large Language Models (LLMs) in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts.
We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy.
arXiv Detail & Related papers (2024-06-25T13:20:08Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Rethinking Machine Unlearning for Large Language Models [85.92660644100582]
We explore machine unlearning in the domain of large language models (LLMs)<n>This initiative aims to eliminate undesirable data influence (e.g., sensitive or illegal information) and the associated model capabilities.
arXiv Detail & Related papers (2024-02-13T20:51:58Z) - Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs [13.262711792955377]
This study explores the effectiveness of Large Language Models (LLMs) for automated essay scoring.
We propose an open-source LLM-based AES system, inspired by the dual-process theory.
We find that our system not only automates the grading process but also enhances the performance and efficiency of human graders.
arXiv Detail & Related papers (2024-01-12T07:50:10Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.