Composing Efficient, Robust Tests for Policy Selection
- URL: http://arxiv.org/abs/2306.07372v1
- Date: Mon, 12 Jun 2023 18:55:56 GMT
- Title: Composing Efficient, Robust Tests for Policy Selection
- Authors: Dustin Morrill, Thomas J. Walsh, Daniel Hernandez, Peter R. Wurman,
Peter Stone
- Abstract summary: We introduce RPOSST, an algorithm to select a small set of test cases from a larger pool.
RPOSST treats the test case selection problem as a two-player game, and prioritizes a solution with provable $k$-of-$N$ robustness.
Empirical results demonstrate that RPOSST finds a small set of test cases that identify high quality policies in a toy one-shot game, poker datasets, and a high-fidelity racing simulator.
- Score: 32.68102141512562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern reinforcement learning systems produce many high-quality policies
throughout the learning process. However, to choose which policy to actually
deploy in the real world, they must be tested under an intractable number of
environmental conditions. We introduce RPOSST, an algorithm to select a small
set of test cases from a larger pool based on a relatively small number of
sample evaluations. RPOSST treats the test case selection problem as a
two-player game and optimizes a solution with provable $k$-of-$N$ robustness,
bounding the error relative to a test that used all the test cases in the pool.
Empirical results demonstrate that RPOSST finds a small set of test cases that
identify high quality policies in a toy one-shot game, poker datasets, and a
high-fidelity racing simulator.
Related papers
- How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective [51.30005925128432]
evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task.<n>Existing benchmarks suffer from high computational costs, score inflation, and a bias towards trivial bugs over rare, critical faults.<n>We introduce a framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix.
arXiv Detail & Related papers (2025-10-09T18:29:24Z) - Reusable Test Suites for Reinforcement Learning [1.5826476446078004]
This work presents Multi-Policy Test Case Selection (MPTCS), a novel automated test suite selection method for RL environments.<n>MPTCS uses a set of policies to select a diverse collection of reusable policy-agnostic test cases that reveal typical flaws in the agents' behavior.<n>We assess the effectiveness of the difficulty score and how the method's effectiveness and cost depend on the number of policies in the set.
arXiv Detail & Related papers (2025-08-29T12:10:05Z) - CodeContests+: High-Quality Test Case Generation for Competitive Programming [14.602111331209203]
We introduce an agent system that creates high-quality test cases for competitive programming problems.<n>We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+.<n>The results indicate that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR)
arXiv Detail & Related papers (2025-06-06T07:29:01Z) - Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.
As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.
By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z) - Realistic Evaluation of Test-Time Adaptation Algorithms: Unsupervised Hyperparameter Selection [1.4530711901349282]
Test-Time Adaptation (TTA) has emerged as a promising strategy for tackling the problem of machine learning model robustness under distribution shifts.
We evaluate existing TTA methods using surrogate-based hp-selection strategies to obtain a more realistic evaluation of their performance.
arXiv Detail & Related papers (2024-07-19T11:58:30Z) - Deep anytime-valid hypothesis testing [29.273915933729057]
We propose a general framework for constructing powerful, sequential hypothesis tests for nonparametric testing problems.
We develop a principled approach of leveraging the representation capability of machine learning models within the testing-by-betting framework.
Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines.
arXiv Detail & Related papers (2023-10-30T09:46:19Z) - Pareto Optimization for Active Learning under Out-of-Distribution Data
Scenarios [79.02009938011447]
We propose a sampling scheme, which selects optimal subsets of unlabeled samples with fixed batch size from the unlabeled data pool.
Experimental results show its effectiveness on both classical Machine Learning (ML) and Deep Learning (DL) tasks.
arXiv Detail & Related papers (2022-07-04T04:11:44Z) - Supervised Learning for Coverage-Directed Test Selection in
Simulation-Based Verification [0.0]
We introduce a novel method for automatic constraint extraction and test selection.
Coverage-directed test selection is based on supervised learning from coverage feedback.
We show how coverage-directed test selection can reduce manual constraint writing, prioritise effective tests, reduce verification resource consumption, and accelerate coverage closure on a large, real-life industrial hardware design.
arXiv Detail & Related papers (2022-05-17T17:49:30Z) - Machine Learning Testing in an ADAS Case Study Using
Simulation-Integrated Bio-Inspired Search-Based Testing [7.5828169434922]
Deeper generates failure-revealing test scenarios for testing a deep neural network-based lane-keeping system.
In the newly proposed version, we utilize a new set of bio-inspired search algorithms, genetic algorithm (GA), $(mu+lambda)$ and $(mu,lambda)$ evolution strategies (ES), and particle swarm optimization (PSO)
Our evaluation shows the newly proposed test generators in Deeper represent a considerable improvement on the previous version.
arXiv Detail & Related papers (2022-03-22T20:27:40Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning
Tasks [14.547623982073475]
Deep learning systems are notoriously difficult to test and debug.
It is essential to conduct test selection and label only those selected "high quality" bug-revealing test inputs for test cost reduction.
We propose a novel test prioritization technique that brings order into the unlabeled test instances according to their bug-revealing capabilities, namely TestRank.
arXiv Detail & Related papers (2021-05-21T03:41:10Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Bloom Origami Assays: Practical Group Testing [90.2899558237778]
Group testing is a well-studied problem with several appealing solutions.
Recent biological studies impose practical constraints for COVID-19 that are incompatible with traditional methods.
We develop a new method combining Bloom filters with belief propagation to scale to larger values of n (more than 100) with good empirical results.
arXiv Detail & Related papers (2020-07-21T19:31:41Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.