Decision Quality Evaluation Framework at Pinterest
- URL: http://arxiv.org/abs/2602.15809v1
- Date: Tue, 17 Feb 2026 18:45:55 GMT
- Title: Decision Quality Evaluation Framework at Pinterest
- Authors: Yuqi Tian, Robert Paine, Attila Dobi, Kevin O'Sullivan, Aravindh Manickavasagam, Faisal Farooq,
- Abstract summary: The framework is centered on a high-trust Golden Set (GDS) curated by subject matter experts (SMEs)<n>We introduce an automated intelligent sampling pipeline that uses propensity scores to efficiently expand dataset coverage.<n>The framework enables a shift from subjective assessments to a data-driven and quantitative practice for managing content safety systems.
- Score: 0.36944296923226316
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online platforms require robust systems to enforce content safety policies at scale. A critical component of these systems is the ability to evaluate the quality of moderation decisions made by both human agents and Large Language Models (LLMs). However, this evaluation is challenging due to the inherent trade-offs between cost, scale, and trustworthiness, along with the complexity of evolving policies. To address this, we present a comprehensive Decision Quality Evaluation Framework developed and deployed at Pinterest. The framework is centered on a high-trust Golden Set (GDS) curated by subject matter experts (SMEs), which serves as a ground truth benchmark. We introduce an automated intelligent sampling pipeline that uses propensity scores to efficiently expand dataset coverage. We demonstrate the framework's practical application in several key areas: benchmarking the cost-performance trade-offs of various LLM agents, establishing a rigorous methodology for data-driven prompt optimization, managing complex policy evolution, and ensuring the integrity of policy content prevalence metrics via continuous validation. The framework enables a shift from subjective assessments to a data-driven and quantitative practice for managing content safety systems.
Related papers
- Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems [54.916243942641444]
Large language models (LLMs) are emerging as key enablers of automation in domains such as telecommunications.<n>We study an edge-cloud-expert cascaded LLM-based knowledge system that supports decision-making through a question-and-answer pipeline.
arXiv Detail & Related papers (2025-12-23T03:10:09Z) - A Structured Evaluation Framework for Low-Code Platform Selection: A Multi-Criteria Decision Model for Enterprise Digital Transformation [0.0]
This paper presents a comprehensive evaluation framework based on five key criteria.<n>We propose a weighted scoring model that allows organizations to quantitatively assess and compare different low-code platforms.
arXiv Detail & Related papers (2025-10-21T12:42:11Z) - Transparent, Evaluable, and Accessible Data Agents: A Proof-of-Concept Framework [0.0]
This article presents a modular, component-based architecture for developing and evaluating AI agents.<n>The system addresses core challenges in data accessibility by enabling non-technical users to interact with complex data warehouses.<n>A cornerstone of the design is its commitment to transparent decision-making, achieved through a multi-layered reasoning framework.
arXiv Detail & Related papers (2025-09-28T23:54:41Z) - Learning Robust Penetration-Testing Policies under Partial Observability: A systematic evaluation [0.28675177318965045]
Penetration testing, the simulation of cyberattacks to identify security vulnerabilities, presents a sequential decision-making problem.<n>Partial observability invalidates the Markov property present in Markov Decision Processes.<n>We investigate, partially observable penetration testing scenarios over host networks of varying size, aiming to better reflect real-world complexity.
arXiv Detail & Related papers (2025-09-24T11:27:54Z) - INSEva: A Comprehensive Chinese Benchmark for Large Language Models in Insurance [48.22571187209047]
INSEva is a Chinese benchmark specifically designed for evaluating AI systems' knowledge and capabilities in insurance.<n> INSEva features a multi-dimensional evaluation taxonomy covering business areas, task formats, difficulty levels, and cognitive-knowledge dimension.<n>Our benchmark implements tailored evaluation methods for assessing both faithfulness and completeness in open-ended responses.
arXiv Detail & Related papers (2025-08-27T03:13:40Z) - Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework [2.102846336724103]
Retrieval-augmented generation (RAG) systems improve large language model outputs by incorporating external knowledge, enabling more informed and context-aware responses.<n>This work introduces a novel multi-agent framework for generating synthetic QA datasets for RAG evaluation that prioritize semantic diversity and privacy preservation.
arXiv Detail & Related papers (2025-08-26T11:16:14Z) - Structured Relevance Assessment for Robust Retrieval-Augmented Language Models [0.0]
We introduce a framework for structured relevance assessment that enhances RALM robustness.<n>Our approach employs a multi-dimensional scoring system that considers both semantic matching and source reliability.<n>Preliminary evaluations demonstrate significant reductions in hallucination rates and improved transparency in reasoning processes.
arXiv Detail & Related papers (2025-07-28T19:20:04Z) - On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective [377.2483044466149]
Generative Foundation Models (GenFMs) have emerged as transformative tools.<n>Their widespread adoption raises critical concerns regarding trustworthiness across dimensions.<n>This paper presents a comprehensive framework to address these challenges through three key contributions.
arXiv Detail & Related papers (2025-02-20T06:20:36Z) - AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z) - SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI [58.29510889419971]
Existing benchmarks for evaluating the security risks and capabilities of code-generating large language models (LLMs) face several key limitations.<n>We introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations.<n>Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities.
arXiv Detail & Related papers (2024-10-14T21:17:22Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - Benchmarks for Deep Off-Policy Evaluation [152.28569758144022]
We present a collection of policies that can be used for benchmarking off-policy evaluation.
The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles.
We provide open-source access to our data and code to foster future research in this area.
arXiv Detail & Related papers (2021-03-30T18:09:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.