The ELEVATE-AI LLMs Framework: An Evaluation Framework for Use of Large Language Models in HEOR: an ISPOR Working Group Report
- URL: http://arxiv.org/abs/2501.12394v1
- Date: Mon, 23 Dec 2024 14:09:10 GMT
- Title: The ELEVATE-AI LLMs Framework: An Evaluation Framework for Use of Large Language Models in HEOR: an ISPOR Working Group Report
- Authors: Rachael L. Fleurence, Dalia Dawoud, Jiang Bian, Mitchell K. Higashi, Xiaoyan Wang, Hua Xu, Jagpreet Chhatwal, Turgay Ayer,
- Abstract summary: This article introduces the ELEVATE AI LLMs framework and checklist.<n>The framework comprises ten evaluation domains, including model characteristics, accuracy, comprehensiveness, and fairness.<n> Validation of the framework and checklist on studies of systematic literature reviews and health economic modeling highlighted their ability to identify strengths and gaps in reporting.
- Score: 12.204470166456561
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Introduction. Generative Artificial Intelligence, particularly large language models (LLMs), offers transformative potential for Health Economics and Outcomes Research (HEOR). However, evaluating the quality, transparency, and rigor of LLM-assisted research lacks standardized guidance. This article introduces the ELEVATE AI LLMs framework and checklist, designed to support researchers and reviewers in assessing LLM use in HEOR. Methods. The ELEVATE AI LLMs framework was developed through a targeted review of existing guidelines and evaluation frameworks. The framework comprises ten evaluation domains, including model characteristics, accuracy, comprehensiveness, and fairness. The accompanying checklist operationalizes the framework. To validate the framework, we applied it to two published studies, demonstrating its usability across different HEOR tasks. Results. The ELEVATE AI LLMs framework provides a comprehensive structure for evaluating LLM-assisted research, while the checklist facilitates practical application. Validation of the framework and checklist on studies of systematic literature reviews and health economic modeling highlighted their ability to identify strengths and gaps in reporting. Limitations. While the ELEVATE AI LLMs framework provides robust guidance, its broader generalizability and applicability to diverse HEOR tasks require further empirical testing. Additionally, several metrics adapted from computer science need further validation in HEOR contexts. Conclusion. The ELEVATE AI LLMs framework and checklist fill a critical gap in HEOR by offering structured guidance for evaluating LLM-assisted research. By promoting transparency, accuracy, and reproducibility, they aim to standardize and improve the integration of LLMs into HEOR, ensuring their outputs meet the field's rigorous standards.
Related papers
- IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey [49.1574468325115]
We conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations.<n>We provide detailed overviews within each category and highlight challenges in this field.<n>We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.
arXiv Detail & Related papers (2025-05-21T19:17:29Z) - TaMPERing with Large Language Models: A Field Guide for using Generative AI in Public Administration Research [0.0]
The integration of Large Language Models (LLMs) into social science research presents transformative opportunities for advancing scientific inquiry.<n>This manuscript introduces the TaMPER framework-a structured methodology organized around five critical decision points: Task, Model, Prompt, Evaluation, and Reporting.
arXiv Detail & Related papers (2025-03-30T21:38:11Z) - Enhanced Bloom's Educational Taxonomy for Fostering Information Literacy in the Era of Large Language Models [16.31527042425208]
This paper proposes an LLM-driven Bloom's Educational Taxonomy that aims to recognize and evaluate students' information literacy (IL) with Large Language Models (LLMs)
The framework delineates the IL corresponding to the cognitive abilities required to use LLM into two distinct stages: Exploration & Action and Creation & Metacognition.
arXiv Detail & Related papers (2025-03-25T08:23:49Z) - Towards Efficient Educational Chatbots: Benchmarking RAG Frameworks [2.362412515574206]
Large Language Models (LLMs) have proven immensely beneficial in education by capturing vast amounts of literature-based information.<n>We propose a generative AI-powered GATE question-answering framework that leverages LLMs to explain GATE solutions and support students in their exam preparation.
arXiv Detail & Related papers (2025-03-02T08:11:07Z) - CodEv: An Automated Grading Framework Leveraging Large Language Models for Consistent and Constructive Feedback [0.0]
This study presents an automated grading framework, CodEv, which leverages Large Language Models (LLMs) to provide consistent and constructive feedback.<n>Our framework also integrates LLM ensembles to improve the accuracy and consistency of scores, along with agreement tests to deliver reliable feedback and code review comments.
arXiv Detail & Related papers (2025-01-10T03:09:46Z) - Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators.<n>It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts.<n>We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z) - A Framework for Using LLMs for Repository Mining Studies in Empirical Software Engineering [12.504438766461027]
Large Language Models (LLMs) have transformed Software Engineering (SE) by providing innovative methods for analyzing software repositories.<n>Our research packages a framework, coined Prompt Refinement and Insights for Mining Empirical Software repositories (PRIMES)<n>Our findings indicate that standardizing prompt engineering and using PRIMES can enhance the reliability and accuracy of studies utilizing LLMs.
arXiv Detail & Related papers (2024-11-15T06:08:57Z) - Evaluating Large Language Models on Financial Report Summarization: An Empirical Study [9.28042182186057]
We conduct a comparative study on three state-of-the-art Large Language Models (LLMs)
Our primary motivation is to explore how these models can be harnessed within finance, a field demanding precision, contextual relevance, and robustness against erroneous or misleading information.
We introduce an innovative evaluation framework that integrates both quantitative metrics (e.g., precision, recall) and qualitative analyses (e.g., contextual fit, consistency) to provide a holistic view of each model's output quality.
arXiv Detail & Related papers (2024-11-11T10:36:04Z) - TRACE: TRansformer-based Attribution using Contrastive Embeddings in LLMs [50.259001311894295]
We propose a novel TRansformer-based Attribution framework using Contrastive Embeddings called TRACE.
We show that TRACE significantly improves the ability to attribute sources accurately, making it a valuable tool for enhancing the reliability and trustworthiness of large language models.
arXiv Detail & Related papers (2024-07-06T07:19:30Z) - Systematic Task Exploration with LLMs: A Study in Citation Text Generation [63.50597360948099]
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks.
We propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement.
We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric.
arXiv Detail & Related papers (2024-07-04T16:41:08Z) - A Survey of Useful LLM Evaluation [20.048914787813263]
Two-stage framework: from core ability'' to agent''
In the "core ability" stage, we discussed the reasoning ability, societal impact, and domain knowledge of LLMs.
In the agent'' stage, we demonstrated embodied action, planning, and tool learning of LLMs agent applications.
arXiv Detail & Related papers (2024-06-03T02:20:03Z) - Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [56.75702900542643]
We introduce AlphaLLM for the self-improvements of Large Language Models.
It integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop.
Our experimental results show that AlphaLLM significantly enhances the performance of LLMs without additional annotations.
arXiv Detail & Related papers (2024-04-18T15:21:34Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs.
We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z) - FAIR Enough: How Can We Develop and Assess a FAIR-Compliant Dataset for Large Language Models' Training? [3.0406004578714008]
The rapid evolution of Large Language Models highlights the necessity for ethical considerations and data integrity in AI development.
While FAIR principles are crucial for ethical data stewardship, their specific application in the context of LLM training data remains an under-explored area.
We propose a novel framework designed to integrate FAIR principles into the LLM development lifecycle.
arXiv Detail & Related papers (2024-01-19T21:21:02Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z) - LLMRec: Benchmarking Large Language Models on Recommendation Task [54.48899723591296]
The application of Large Language Models (LLMs) in the recommendation domain has not been thoroughly investigated.
We benchmark several popular off-the-shelf LLMs on five recommendation tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization.
The benchmark results indicate that LLMs displayed only moderate proficiency in accuracy-based tasks such as sequential and direct recommendation.
arXiv Detail & Related papers (2023-08-23T16:32:54Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z) - Development of the ChatGPT, Generative Artificial Intelligence and
Natural Large Language Models for Accountable Reporting and Use (CANGARU)
Guidelines [0.33249867230903685]
CANGARU aims to foster a cross-disciplinary global consensus on the ethical use, disclosure, and proper reporting of GAI/GPT/LLM technologies in academia.
The present protocol consists of an ongoing systematic review of GAI/GPT/LLM applications to understand the linked ideas, findings, and reporting standards in scholarly research, and to formulate guidelines for its use and disclosure.
arXiv Detail & Related papers (2023-07-18T05:12:52Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Investigating Fairness Disparities in Peer Review: A Language Model
Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs)
We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date.
We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.