Related papers: SAGE: Scalable AI Governance & Evaluation

SAGE: Scalable AI Governance & Evaluation

URL: http://arxiv.org/abs/2602.07840v2
Date: Tue, 10 Feb 2026 03:26:08 GMT
Title: SAGE: Scalable AI Governance & Evaluation
Authors: Benjamin Le, Xueying Lu, Nick Stern, Wenqiong Liu, Igor Lapchuk, Xiang Li, Baofen Zheng, Kevin Rosenberg, Jiewen Huang, Zhe Zhang, Abraham Cabangbang, Satej Milind Wagle, Jianqiang Shen, Raghavan Muthuregunathan, Abhinav Gupta, Mathew Teoh, Andrew Kirk, Thomas Kwan, Jingwei Wu, Wenjing Zhang,
Abstract summary: textbfSAGE is a framework that operationalizes high-quality human product judgment as a scalable evaluation signal.<n>SAGE was deployed within LinkedIn Search ecosystems and powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics.
Score: 10.238041570564395
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.

Related papers

Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction [1.3511057160494195]
Leader-follower interaction is an important paradigm in human-robot interaction (HRI)<n>Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated.
arXiv Detail & Related papers (2026-02-26T18:20:26Z)
Pailitao-VL: Unified Embedding and Reranker for Real-Time Multi-Modal Industrial Search [31.824675235862532]
Pailitao-VL is a comprehensive multi-modal retrieval system engineered for real-time industrial search.<n>We address three critical challenges in the current SOTA solution: insufficient retrieval granularity, vulnerability to environmental noise, and prohibitive efficiency-performance gap.<n>This work demonstrates a robust and scalable path for deploying advanced MLLM-based retrieval architectures in demanding, large-scale production environments.
arXiv Detail & Related papers (2026-02-14T10:13:48Z)
ProAct: Agentic Lookahead in Interactive Environments [56.50613398808361]
ProAct is a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm.<n>We introduce Grounded LookAhead Distillation (GLAD), where the agent undergoes supervised fine-tuning on trajectories derived from environment-based search.<n>We also propose the Monte-Carlo Critic (MC-Critic), a plug-and-play auxiliary value estimator designed to enhance policy-gradient algorithms.
arXiv Detail & Related papers (2026-02-05T05:45:16Z)
The Shadow Self: Intrinsic Value Misalignment in Large Language Model Agents [37.75212140218036]
We formalize the Loss-of-Control risk and identify the previously underexamined Intrinsic Value Misalignment (Intrinsic VM)<n>We then introduce IMPRESS, a scenario-driven framework for systematically assessing this risk.<n>We evaluate Intrinsic VM on 21 state-of-the-art LLM agents and find that it is a common and broadly observed safety risk across models.
arXiv Detail & Related papers (2026-01-24T07:09:50Z)
Benchmarking Contextual Understanding for In-Car Conversational Systems [0.9437812993238097]
In-Car Conversational Question Answering (ConvQA) systems significantly enhance user experience by enabling seamless voice interactions.<n>This paper explores the use of Large Language Models (LLMs) alongside advanced prompting techniques and agent-based methods to evaluate the extent to which ConvQA system responses adhere to user utterances.
arXiv Detail & Related papers (2025-12-12T21:15:49Z)
SPACeR: Self-Play Anchoring with Centralized Reference Models [50.55045557371374]
Sim agent policies are realistic, human-like, fast, and scalable in multi-agent settings.<n>Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data.<n>We propose SPACeR, a framework that leverages a pretrained tokenized autoregressive motion model as a central reference policy.
arXiv Detail & Related papers (2025-10-20T19:53:02Z)
Steerable Adversarial Scenario Generation through Test-Time Preference Alignment [58.37104890690234]
Adversarial scenario generation is a cost-effective approach for safety assessment of autonomous driving systems.<n>We introduce a new framework named textbfSteerable textbfAdversarial scenario textbfGEnerator (SAGE)<n>SAGE enables fine-grained test-time control over the trade-off between adversariality and realism without any retraining.
arXiv Detail & Related papers (2025-09-24T13:27:35Z)
LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation [40.06592175227558]
This paper investigates a principled approach to evaluating Retrieval-Augmented Generation systems in legal contexts.<n>We find that traditional agreement metrics like Krippendorff's alpha can be misleading in the skewed distributions typical of AI system evaluations.<n>Our findings suggest a path toward scalable, cost-effective evaluation that maintains the precision demanded by legal applications.
arXiv Detail & Related papers (2025-09-15T19:20:21Z)
STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning [54.28691219536054]
We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities.<n>We develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping.<n>Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-26T08:47:58Z)
Generalising from Self-Produced Data: Model Training Beyond Human Constraints [0.0]
This paper introduces a novel framework in which AI models autonomously generate and validate new knowledge.<n>Central to this approach is an unbounded, ungamable numeric reward that guides learning without requiring human benchmarks.
arXiv Detail & Related papers (2025-04-07T03:48:02Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.