Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction
- URL: http://arxiv.org/abs/2602.17106v1
- Date: Thu, 19 Feb 2026 06:04:28 GMT
- Title: Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction
- Authors: Xiaoran Cai, Wang Yang, Xiyu Ren, Chekun Law, Rohit Sharma, Peng Qi,
- Abstract summary: We propose adopting a universal human-AI collaboration framework to generate trustworthy benchmark datasets for evaluating sustainability rating methodologies.<n>We provide principled criteria and a scoring system that guide the construction of firm-level benchmark datasets.<n>We call on the broader AI community to adopt AI-powered approaches to strengthen and advance sustainability rating methodologies.
- Score: 4.26631223053023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sustainability or ESG rating agencies use company disclosures and external data to produce scores or ratings that assess the environmental, social, and governance performance of a company. However, sustainability ratings across agencies for a single company vary widely, limiting their comparability, credibility, and relevance to decision-making. To harmonize the rating results, we propose adopting a universal human-AI collaboration framework to generate trustworthy benchmark datasets for evaluating sustainability rating methodologies. The framework comprises two complementary parts: STRIDE (Sustainability Trust Rating & Integrity Data Equation) provides principled criteria and a scoring system that guide the construction of firm-level benchmark datasets using large language models (LLMs), and SR-Delta, a discrepancy-analysis procedural framework that surfaces insights for potential adjustments. The framework enables scalable and comparable assessment of sustainability rating methodologies. We call on the broader AI community to adopt AI-powered approaches to strengthen and advance sustainability rating methodologies that support and enforce urgent sustainability agendas.
Related papers
- A Theoretical Framework for Adaptive Utility-Weighted Benchmarking [0.0]
This paper introduces a theoretical framework that reconceptualizes benchmarking as a multilayer, adaptive network linking evaluation metrics, model components, and stakeholder groups through weighted interactions.<n>Using conjoint-derived utilities and a human-in-the-loop update rule, we formalize how human tradeoffs can be embedded into benchmark structure and how benchmarks can evolve dynamically while preserving stability and interpretability.
arXiv Detail & Related papers (2026-02-12T19:33:47Z) - A Framework for Data Valuation and Monetisation [0.0]
This paper introduces a unified valuation framework that integrates economic, governance, and strategic perspectives into a coherent decision-support model.<n>The model combines qualitative scoring, cost- and utility-based estimation, relevance/quality indexing, and multi-criteria weighting to define data value transparently and systematically.
arXiv Detail & Related papers (2025-12-08T15:57:26Z) - CCE: Confidence-Consistency Evaluation for Time Series Anomaly Detection [56.302586730134806]
We introduce Confidence-Consistency Evaluation (CCE), a novel evaluation metric.<n>CCE simultaneously measures prediction confidence and uncertainty consistency.<n>We also establish RankEval, a benchmark for comparing the ranking capabilities of various metrics.
arXiv Detail & Related papers (2025-09-01T03:38:38Z) - CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization [53.79487826635141]
Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in tasks with objectively verifiable answers.<n>But it struggles with open-ended subjective tasks like role-playing dialogue.<n>Traditional reward modeling approaches, which rely on independent sample-wise scoring, face dual challenges: subjective evaluation criteria and unstable reward signals.<n>Motivated by the insight that human evaluation inherently combines explicit criteria with implicit comparative judgments, we propose Comparative Policy Optimization.
arXiv Detail & Related papers (2025-08-12T16:49:18Z) - Evaluating DAO Sustainability and Longevity Through On-Chain Governance Metrics [2.114921680609289]
Decentralised Autonomous Organisations (DAOs) automate governance and resource allocation through smart contracts, aiming to shift decision-making to distributed token holders.<n>This paper addresses these issues by identifying research gaps in financial evaluation and introducing a framework of Key Performance Indicators.<n>We apply the framework to a custom-built dataset of real-worlds constructed from on-chain data and analysed using non-parametric methods.<n>The results reveal recurring governance patterns, including low participation rates and high proposer concentration, which may undermine long-term viability.
arXiv Detail & Related papers (2025-04-15T16:13:20Z) - SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection.<n>Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains.<n>We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z) - On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective [377.2483044466149]
Generative Foundation Models (GenFMs) have emerged as transformative tools.<n>Their widespread adoption raises critical concerns regarding trustworthiness across dimensions.<n>This paper presents a comprehensive framework to address these challenges through three key contributions.
arXiv Detail & Related papers (2025-02-20T06:20:36Z) - Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework [2.4861619769660637]
We propose an estimands framework adapted from international clinical trials guidelines.
This framework provides a systematic structure for inference and reporting in evaluations.
We demonstrate how the framework can help uncover underlying issues, their causes, and potential solutions.
arXiv Detail & Related papers (2024-06-14T18:47:37Z) - Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition [70.60872754129832]
First NeurIPS competition on unlearning sought to stimulate the development of novel algorithms.
Nearly 1,200 teams from across the world participated.
We analyze top solutions and delve into discussions on benchmarking unlearning.
arXiv Detail & Related papers (2024-06-13T12:58:00Z) - A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions [60.06461883533697]
We first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill.<n>We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.<n>Our analysis also offers insights into defending against neural Trojans by utilizing the attributions.
arXiv Detail & Related papers (2024-05-02T13:48:37Z) - Literature Review of Current Sustainability Assessment Frameworks and
Approaches for Organizations [10.045497511868172]
This systematic literature review explores sustainability assessment frameworks (SAFs) across diverse industries.
The review focuses on SAF design approaches including the methods used for Sustainability Indicator (SI) selection, relative importance assessment, and interdependency analysis.
arXiv Detail & Related papers (2024-03-07T18:14:52Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.