Related papers: R\&D evaluation methodology based on group-AHP with uncertainty

R\&D evaluation methodology based on group-AHP with uncertainty

URL: http://arxiv.org/abs/2108.02595v2
Date: Mon, 22 Nov 2021 16:31:23 GMT
Title: R\&D evaluation methodology based on group-AHP with uncertainty
Authors: Alberto Garinei, Emanuele Piccioni, Massimiliano Proietti, Andrea Marini, Stefano Speziali, Marcello Marconi, Raffaella Di Sante, Sara Casaccia, Paolo Castellini, Milena Martarelli, Nicola Paone, Gian Marco Revel, Lorenzo Scalise, Marco Arnesano, Paolo Chiariotti, Roberto Montanini, Antonino Quattrocchi, Sergio Silvestri, Giorgio Ficco, Emanuele Rizzuto, Andrea Scorza, Matteo Lancini, Gianluca Rossi, Roberto Marsili, Emanuele Zappa, Salvatore Sciuto, Gaetano Vacca, Laura Fabbiano
Abstract summary: We present an approach to evaluate Research & Development (R&D) performance based on the Analytic Hierarchy Process (AHP) method. We single out a set of indicators needed for R&D performance evaluation. The numerical values associated with all the indicators are then used to assign a score to a given R&D project.
Score: 0.17689918341582753
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we present an approach to evaluate Research \& Development (R\&D) performance based on the Analytic Hierarchy Process (AHP) method. Through a set of questionnaires submitted to a team of experts, we single out a set of indicators needed for R\&D performance evaluation. The indicators, together with the corresponding criteria, form the basic hierarchical structure of the AHP method. The numerical values associated with all the indicators are then used to assign a score to a given R\&D project. In order to aggregate consistently the values taken on by the different indicators, we operate on them so that they are mapped to dimensionless quantities lying in a unit interval. This is achieved by employing the empirical Cumulative Density Function (CDF) for each of the indicators. We give a thorough discussion on how to assign a score to an R\&D project along with the corresponding uncertainty due to possible inconsistencies of the decision process. A particular example of R\&D performance is finally considered.

Related papers

Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation [4.08768677009363]
We propose a generalizable, modular framework for evaluating agent task completion independent of the task domain.<n>We validate our framework by evaluating the Magentic-One Actor Agent on two benchmarks, GAIA and BigCodeBench.<n>Our Judge Agent predicts task success with closer agreement to human evaluations, achieving 4.76% and 10.52% higher alignment accuracy, respectively.
arXiv Detail & Related papers (2025-08-07T15:39:48Z)
RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z)
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets [0.0]
Retrieval-Augmented Generation (RAG) has advanced significantly in recent years. RAG complexity poses substantial challenges for systematic evaluation and quality enhancement. This study systematically reviews 63 academic articles to provide a comprehensive overview of state-of-the-art RAG evaluation methodologies.
arXiv Detail & Related papers (2025-04-28T08:22:19Z)
SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection. Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains. We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z)
SedarEval: Automated Evaluation using Self-Adaptive Rubrics [4.97150240417381]
We propose a new evaluation paradigm based on self-adaptive rubrics. SedarEval consists of 1,000 meticulously crafted questions, each with its own self-adaptive rubric. We train a specialized evaluator language model (evaluator LM) to supplant human graders.
arXiv Detail & Related papers (2025-01-26T16:45:09Z)
Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks [17.520137576423593]
We aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR) We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them. We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR.
arXiv Detail & Related papers (2024-08-29T17:55:07Z)
Top-K Pairwise Ranking: Bridging the Gap Among Ranking-Based Measures for Multi-Label Classification [120.37051160567277]
This paper proposes a novel measure named Top-K Pairwise Ranking (TKPR) A series of analyses show that TKPR is compatible with existing ranking-based measures. On the other hand, we establish a sharp generalization bound for the proposed framework based on a novel technique named data-dependent contraction.
arXiv Detail & Related papers (2024-07-09T09:36:37Z)
Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attribution Methods [49.62131719441252]
Attribution methods compute importance scores for input features to explain the output predictions of deep models. In this work, we first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill. We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.
arXiv Detail & Related papers (2024-05-02T13:48:37Z)
A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups. We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z)
DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models [4.953092503184905]
This work proposes DCR, an automated framework for evaluating and improving the consistency of Large Language Models (LLMs) generated texts. We introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score. Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
arXiv Detail & Related papers (2024-01-04T08:34:16Z)
A Framework for Auditing Multilevel Models using Explainability Methods [2.578242050187029]
An audit framework for technical assessment of regressions is proposed. The focus is on three aspects, model, discrimination, and transparency and explainability. It is demonstrated that popular explainability methods, such as SHAP and LIME, underperform in accuracy when interpreting these models.
arXiv Detail & Related papers (2022-07-04T17:53:21Z)
Multiple-criteria Heuristic Rating Estimation [0.0]
Heuristic Rating Estimation (HRE) method proposed in 2014 tried to bring answer to this question. We analyze how HRE can be used as part of the Analytic Hierarchy Process hierarchical framework.
arXiv Detail & Related papers (2022-05-20T20:12:04Z)
Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors. Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP) We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z)
Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA) We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
Uncertainty-aware Score Distribution Learning for Action Quality Assessment [91.05846506274881]
We propose an uncertainty-aware score distribution learning (USDL) approach for action quality assessment (AQA) Specifically, we regard an action as an instance associated with a score distribution, which describes the probability of different evaluated scores. Under the circumstance where fine-grained score labels are available, we devise a multi-path uncertainty-aware score distributions learning (MUSDL) method to explore the disentangled components of a score.
arXiv Detail & Related papers (2020-06-13T15:41:29Z)
On the Ambiguity of Rank-Based Evaluation of Entity Alignment or Link Prediction Methods [27.27230441498167]
We take a closer look at the evaluation of two families of methods for enriching information from knowledge graphs: Link Prediction and Entity Alignment. In particular, we demonstrate that all existing scores can hardly be used to compare results across different datasets. We show that this leads to various problems in the interpretation of results, which may support misleading conclusions.
arXiv Detail & Related papers (2020-02-17T12:26:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.