Related papers: CIRCLE: A Framework for Evaluating AI from a Real-World Lens

CIRCLE: A Framework for Evaluating AI from a Real-World Lens

URL: http://arxiv.org/abs/2602.24055v2
Date: Tue, 03 Mar 2026 18:25:54 GMT
Title: CIRCLE: A Framework for Evaluating AI from a Real-World Lens
Authors: Reva Schwartz, Carina Westling, Morgan Briggs, Marzieh Fadaee, Isar Nejadgholi, Matthew Holmes, Fariza Rashid, Maya Carlyle, Afaf Taïk, Kyra Wilson, Peter Douglas, Theodora Skeadas, Gabriella Waters, Rumman Chowdhury, Thiago Lacerda,
Abstract summary: CIRCLE aims to bridge the gap between model-centric performance metrics and AI's materialized outcomes in deployment.<n>CIRCLE provides a structured, prospective protocol for linking context-sensitive qualitative insights to scalable quantitative metrics.
Score: 10.028017198571833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper proposes CIRCLE, a six-stage, lifecycle-based framework to bridge the reality gap between model-centric performance metrics and AI's materialized outcomes in deployment. While existing frameworks like MLOps focus on system stability and benchmarks measure abstract capabilities, decision-makers outside the AI stack lack systematic evidence about the behavior of AI technologies under real-world user variability and constraints. CIRCLE operationalizes the Validation phase of TEVV (Test, Evaluation, Verification, and Validation) by formalizing the translation of stakeholder concerns outside the stack into measurable signals. Unlike participatory design, which often remains localized, or algorithmic audits, which are often retrospective, CIRCLE provides a structured, prospective protocol for linking context-sensitive qualitative insights to scalable quantitative metrics. By integrating methods such as field testing, red teaming, and longitudinal studies into a coordinated pipeline, CIRCLE produces systematic knowledge: evidence that is comparable across sites yet sensitive to local context. This can enable governance based on materialized downstream effects rather than theoretical capabilities.

Related papers

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval [60.25608870901428]
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs)<n>We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source robustness.
arXiv Detail & Related papers (2026-03-05T18:42:51Z)
Towards Worst-Case Guarantees with Scale-Aware Interpretability [58.519943565092724]
Neural networks organize information according to the hierarchical, multi-scale structure of natural data.<n>We propose a unifying research agenda -- emphscale-aware interpretability -- to develop formal machinery and interpretability tools.
arXiv Detail & Related papers (2026-02-05T01:22:31Z)
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models [122.58252919699122]
Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the decision-making of Large Language Models (LLMs)<n>We present a practical survey structured around the pipeline: "Awesomeinterventionable-MI-Survey"
arXiv Detail & Related papers (2026-01-20T14:23:23Z)
Explainable Neural Inverse Kinematics for Obstacle-Aware Robotic Manipulation: A Comparative Analysis of IKNet Variants [0.28544513613730205]
Deep neural networks have accelerated inverse-kinematics (IK) inference to the point where low cost manipulators can execute complex trajectories in real time.<n>This study proposes an explainability centered workflow that integrates Shapley-value attribution with physics-based obstacle avoidance evaluation.
arXiv Detail & Related papers (2025-12-29T09:02:02Z)
Variance-Bounded Evaluation of Entity-Centric AI Systems Without Ground Truth: Theory and Measurement [0.0]
We introduce VB-Score, a variance-bounded evaluation framework for entity-centric AI systems.<n> VB-Score enumerates plausible interpretations through constraint relaxation and Monte Carlo sampling.<n>It then evaluates system outputs by their expected success across interpretations, penalized by variance to assess robustness of the system.
arXiv Detail & Related papers (2025-09-26T07:54:38Z)
Technical Report: Facilitating the Adoption of Causal Inference Methods Through LLM-Empowered Co-Pilot [44.336297829718795]
We introduce CATE-B, an open-source co-pilot system that uses large language models (LLMs) within an agentic framework to guide users through treatment effect estimation.<n>CATE-B assists in (i) constructing a structural causal model via causal discovery and LLM-based edge orientation, (ii) identifying robust adjustment sets through a novel Minimal Uncertainty Adjustment Set criterion, and (iii) selecting appropriate regression methods tailored to the causal structure and dataset characteristics.
arXiv Detail & Related papers (2025-08-14T12:20:51Z)
SOI is the Root of All Evil: Quantifying and Breaking Similar Object Interference in Single Object Tracking [25.076012214989433]
We present the first systematic investigation and quantification of Similar Object Interference (SOI)<n> eliminating interference sources leads to substantial performance improvements (AUC gains up to 4.35) across all SOTA trackers.<n>We construct SOIBench-the first semantic cognitive guidance benchmark specifically targeting SOI challenges.
arXiv Detail & Related papers (2025-08-13T06:12:43Z)
Evaluations at Work: Measuring the Capabilities of GenAI in Use [28.124088786766965]
Current AI benchmarks miss the messy, multi-turn nature of human-AI collaboration.<n>We present an evaluation framework that decomposes real-world tasks into interdependent subtasks.
arXiv Detail & Related papers (2025-05-15T23:06:23Z)
Cooperative Resilience in Artificial Intelligence Multiagent Systems [2.0608564715600273]
This paper proposes a clear definition of cooperative resilience' and a methodology for its quantitative measurement. The results highlight the crucial role of resilience metrics in analyzing how the collective system prepares for, resists, recovers from, sustains well-being, and transforms in the face of disruptions.
arXiv Detail & Related papers (2024-09-20T03:28:48Z)
CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.<n>Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)<n>We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z)
Metrics reloaded: Recommendations for image analysis validation [59.60445111432934]
Metrics Reloaded is a comprehensive framework guiding researchers in the problem-aware selection of metrics. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics.
arXiv Detail & Related papers (2022-06-03T15:56:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.