Related papers: A Review for Deep Reinforcement Learning in Atari:Benchmarks, Challenges, and Solutions

A Review for Deep Reinforcement Learning in Atari:Benchmarks, Challenges, and Solutions

URL: http://arxiv.org/abs/2112.04145v2
Date: Fri, 10 Dec 2021 14:48:34 GMT
Title: A Review for Deep Reinforcement Learning in Atari:Benchmarks, Challenges, and Solutions
Authors: Jiajun Fan
Abstract summary: Arcade Learning Environment (ALE) is proposed as an evaluation platform for empirically assessing the generality of agents across Atari 2600 games. From Deep Q-Networks (DQN) to Agent57, RL agents seem to achieve superhuman performance in ALE. We propose a novel Atari benchmark based on human world records (HWR), which puts forward higher requirements for RL agents on both final performance and learning efficiency.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Arcade Learning Environment (ALE) is proposed as an evaluation platform for empirically assessing the generality of agents across dozens of Atari 2600 games. ALE offers various challenging problems and has drawn significant attention from the deep reinforcement learning (RL) community. From Deep Q-Networks (DQN) to Agent57, RL agents seem to achieve superhuman performance in ALE. However, is this the case? In this paper, to explore this problem, we first review the current evaluation metrics in the Atari benchmarks and then reveal that the current evaluation criteria of achieving superhuman performance are inappropriate, which underestimated the human performance relative to what is possible. To handle those problems and promote the development of RL research, we propose a novel Atari benchmark based on human world records (HWR), which puts forward higher requirements for RL agents on both final performance and learning efficiency. Furthermore, we summarize the state-of-the-art (SOTA) methods in Atari benchmarks and provide benchmark results over new evaluation metrics based on human world records. We concluded that at least four open challenges hinder RL agents from achieving superhuman performance from those new benchmark results. Finally, we also discuss some promising ways to handle those problems.

Related papers

Reevaluating Policy Gradient Methods for Imperfect-Information Games [94.45878689061335]
We conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 5600 training runs, FP, DO, and CFR-based approaches fail to outperform generic policy gradient methods.
arXiv Detail & Related papers (2025-02-13T03:38:41Z)
Reinforcing Competitive Multi-Agents for Playing So Long Sucker [0.393259574660092]
This paper examines the use of classical deep reinforcement learning (DRL) algorithms, DQN, DDQN, and Dueling DQN, in the strategy game So Long Sucker. The study's primary goal is to teach autonomous agents the game's rules and strategies using classical DRL methods.
arXiv Detail & Related papers (2024-11-17T12:38:13Z)
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z)
H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark [7.840781070208872]
Since 2019, limited progress has been observed on the challenge using existing artificial intelligence methods. Previous work explored how well humans can solve tasks from the ARC benchmark. We obtain a more robust estimate of human performance by evaluating 1729 humans on the full set of 400 training and 400 evaluation tasks.
arXiv Detail & Related papers (2024-09-02T17:11:32Z)
Leveraging Reward Consistency for Interpretable Feature Discovery in Reinforcement Learning [69.19840497497503]
It is argued that the commonly used action matching principle is more like an explanation of deep neural networks (DNNs) than the interpretation of RL agents. We propose to consider rewards, the essential objective of RL agents, as the essential objective of interpreting RL agents. We verify and evaluate our method on the Atari 2600 games as well as Duckietown, a challenging self-driving car simulator environment.
arXiv Detail & Related papers (2023-09-04T09:09:54Z)
ARB: Advanced Reasoning Benchmark for Large Language Models [94.37521840642141]
We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields. As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge. We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks.
arXiv Detail & Related papers (2023-07-25T17:55:19Z)
Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning [23.062590084580542]
Int-HRL: Hierarchical RL with intention-based sub-goals that are inferred from human eye gaze. Our evaluations show that replacing hand-crafted sub-goals with automatically extracted intentions leads to a HRL agent that is significantly more sample efficient than previous methods.
arXiv Detail & Related papers (2023-06-20T12:12:16Z)
Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment. We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent. We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z)
Mask Atari for Deep Reinforcement Learning as POMDP Benchmarks [3.549772411359722]
Mask Atari is a new benchmark to help solve partially observable Markov decision process (POMDP) problems. It is constructed based on Atari 2600 games with controllable, moveable, and learnable masks as the observation area. We describe the challenges and features of our benchmark and evaluate several baselines with Mask Atari.
arXiv Detail & Related papers (2022-03-31T03:34:02Z)
Mastering Atari Games with Limited Data [73.6189496825209]
We propose a sample efficient model-based visual RL algorithm built on MuZero, which we name EfficientZero. Our method achieves 190.4% mean human performance on the Atari 100k benchmark with only two hours of real-time game experience. This is the first time an algorithm achieves super-human performance on Atari games with such little data.
arXiv Detail & Related papers (2021-10-30T09:13:39Z)
Robust Deep Reinforcement Learning through Adversarial Loss [74.20501663956604]
Recent studies have shown that deep reinforcement learning agents are vulnerable to small adversarial perturbations on the agent's inputs. We propose RADIAL-RL, a principled framework to train reinforcement learning agents with improved robustness against adversarial attacks.
arXiv Detail & Related papers (2020-08-05T07:49:42Z)
Agent57: Outperforming the Atari Human Benchmark [15.75730239983062]
Atari games have been a long-standing benchmark in reinforcement learning. We propose Agent57, the first deep RL agent that outperforms the standard human benchmark on all 57 Atari games.
arXiv Detail & Related papers (2020-03-30T11:33:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.