Related papers: Evaluating the Robustness of Collaborative Agents

Evaluating the Robustness of Collaborative Agents

URL: http://arxiv.org/abs/2101.05507v1
Date: Thu, 14 Jan 2021 09:02:45 GMT
Title: Evaluating the Robustness of Collaborative Agents
Authors: Paul Knott, Micah Carroll, Sam Devlin, Kamil Ciosek, Katja Hofmann, A. D. Dragan and Rohin Shah
Abstract summary: We take inspiration from the practice of emphunit testing in software engineering. We apply this methodology to build a suite of unit tests for the Overcooked-AI environment.
Score: 25.578427956101603
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In order for agents trained by deep reinforcement learning to work alongside humans in realistic settings, we will need to ensure that the agents are \emph{robust}. Since the real world is very diverse, and human behavior often changes in response to agent deployment, the agent will likely encounter novel situations that have never been seen during training. This results in an evaluation challenge: if we cannot rely on the average training or validation reward as a metric, then how can we effectively evaluate robustness? We take inspiration from the practice of \emph{unit testing} in software engineering. Specifically, we suggest that when designing AI agents that collaborate with humans, designers should search for potential edge cases in \emph{possible partner behavior} and \emph{possible states encountered}, and write tests which check that the behavior of the agent in these edge cases is reasonable. We apply this methodology to build a suite of unit tests for the Overcooked-AI environment, and use this test suite to evaluate three proposals for improving robustness. We find that the test suite provides significant insight into the effects of these proposals that were generally not revealed by looking solely at the average validation reward.

Related papers

Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.69724201080155]
We show that many agentic benchmarks have issues in task setup or reward design.<n>Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms.<n>We introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience.
arXiv Detail & Related papers (2025-07-03T17:35:31Z)
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks [52.47895046206854]
FieldWorkArena is a benchmark for agentic AI targeting real-world field work.<n>This paper defines a new action space that agentic AI should possess for real world work environment benchmarks.
arXiv Detail & Related papers (2025-05-26T08:21:46Z)
Training on the Test Task Confounds Evaluation and Emergence [16.32378359459614]
We show that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation.
arXiv Detail & Related papers (2024-07-10T17:57:58Z)
ProAgent: Building Proactive Cooperative Agents with Large Language Models [89.53040828210945]
ProAgent is a novel framework that harnesses large language models to create proactive agents. ProAgent can analyze the present state, and infer the intentions of teammates from observations. ProAgent exhibits a high degree of modularity and interpretability, making it easily integrated into various coordination scenarios.
arXiv Detail & Related papers (2023-08-22T10:36:56Z)
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs. The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples. We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z)
Can Agents Run Relay Race with Strangers? Generalization of RL to Out-of-Distribution Trajectories [88.08381083207449]
We show the prevalence of emphgeneralization failure on controllable states from stranger agents. We propose a novel method called Self-Trajectory Augmentation (STA), which will reset the environment to the agent's old states according to the Q function during training.
arXiv Detail & Related papers (2023-04-26T10:12:12Z)
Automatic Evaluation of Excavator Operators using Learned Reward Functions [5.372817906484557]
We propose a novel strategy for the automatic evaluation of excavator operators. We take into account the internal dynamics of the excavator and the safety criterion at every time step to evaluate the performance. For a policy learned using these external reward prediction models, our results demonstrate safer solutions.
arXiv Detail & Related papers (2022-11-15T06:58:00Z)
Beyond Value: CHECKLIST for Testing Inferences in Planning-Based RL [20.360392791376707]
Reinforcement learning (RL) agents are commonly evaluated via their expected value over a distribution of test scenarios. We consider testing RL agents that make decisions via online tree search using a learned transition model and value function. We present a user study involving knowledgeable AI researchers using the approach to evaluate an agent trained to play a complex real-time strategy game.
arXiv Detail & Related papers (2022-06-04T18:16:05Z)
Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot [71.28884625011987]
Melting Pot is a MARL evaluation suite that uses reinforcement learning to reduce the human labor required to create novel test scenarios. We have created over 80 unique test scenarios covering a broad range of research topics. We apply these test scenarios to standard MARL training algorithms, and demonstrate how Melting Pot reveals weaknesses not apparent from training performance alone.
arXiv Detail & Related papers (2021-07-14T17:22:14Z)
Catch Me if I Can: Detecting Strategic Behaviour in Peer Assessment [61.24399136715106]
We consider the issue of strategic behaviour in various peer-assessment tasks, including peer grading of exams or homeworks and peer review in hiring or promotions. Our focus is on designing methods for detection of such manipulations. Specifically, we consider a setting in which agents evaluate a subset of their peers and output rankings that are later aggregated to form a final ordering.
arXiv Detail & Related papers (2020-10-08T15:08:40Z)
Robust Deep Reinforcement Learning through Adversarial Loss [74.20501663956604]
Recent studies have shown that deep reinforcement learning agents are vulnerable to small adversarial perturbations on the agent's inputs. We propose RADIAL-RL, a principled framework to train reinforcement learning agents with improved robustness against adversarial attacks.
arXiv Detail & Related papers (2020-08-05T07:49:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.