Swarm Behavior Cloning
- URL: http://arxiv.org/abs/2412.07617v1
- Date: Tue, 10 Dec 2024 15:54:57 GMT
- Title: Swarm Behavior Cloning
- Authors: Jonas Nüßlein, Maximilian Zorn, Philipp Altmann, Claudia Linnhoff-Popien,
- Abstract summary: In sequential decision-making environments, the primary approaches for training agents are Reinforcement Learning (RL) and Imitation Learning (IL)<n>This paper addresses the issue of increasing action differences -- the observation that discrepancies between the $N$ predicted actions grow in states that are underrepresented in the training data.<n>We propose a method that fosters greater alignment among the policies while preserving the diversity of their computations.
- Score: 4.9854403800887415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In sequential decision-making environments, the primary approaches for training agents are Reinforcement Learning (RL) and Imitation Learning (IL). Unlike RL, which relies on modeling a reward function, IL leverages expert demonstrations, where an expert policy $\pi_e$ (e.g., a human) provides the desired behavior. Formally, a dataset $D$ of state-action pairs is provided: $D = {(s, a = \pi_e(s))}$. A common technique within IL is Behavior Cloning (BC), where a policy $\pi(s) = a$ is learned through supervised learning on $D$. Further improvements can be achieved by using an ensemble of $N$ individually trained BC policies, denoted as $E = {\pi_i(s)}{1 \leq i \leq N}$. The ensemble's action $a$ for a given state $s$ is the aggregated output of the $N$ actions: $a = \frac{1}{N} \sum{i} \pi_i(s)$. This paper addresses the issue of increasing action differences -- the observation that discrepancies between the $N$ predicted actions grow in states that are underrepresented in the training data. Large action differences can result in suboptimal aggregated actions. To address this, we propose a method that fosters greater alignment among the policies while preserving the diversity of their computations. This approach reduces action differences and ensures that the ensemble retains its inherent strengths, such as robustness and varied decision-making. We evaluate our approach across eight diverse environments, demonstrating a notable decrease in action differences and significant improvements in overall performance, as measured by mean episode returns.
Related papers
- Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning [4.899818550820576]
We propose a new $textttSUBPLE-MFQ$ ($textbfSubsample$-$textbfMean-$textbfF$ield-$textbfQ$-learning) and a decentralized randomized policy for a system with $n$ agents.<n>We prove that this learned policy converges to the optimal policy on the order of $tilde$O (1/sqrtk)$ as the number of subsampled agents $k$ increases.
arXiv Detail & Related papers (2024-12-01T03:45:17Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Reinforcement Learning in a Birth and Death Process: Breaking the
Dependence on the State Space [0.0]
We revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure.
In our main result, we show that the regret of a slightly-ted version of the classical learning algorithm sc Ucrl2 is in fact upper bounded by $tildemathcalO(sqrtEAT)$ where $E$ is related to the weighted second moment of the stationary measure of a reference policy.
arXiv Detail & Related papers (2023-02-21T13:28:37Z) - Multi-Task Imitation Learning for Linear Dynamical Systems [50.124394757116605]
We study representation learning for efficient imitation learning over linear systems.
We find that the imitation gap over trajectories generated by the learned target policy is bounded by $tildeOleft( frack n_xHN_mathrmshared + frack n_uN_mathrmtargetright)$.
arXiv Detail & Related papers (2022-12-01T00:14:35Z) - Tractable Optimality in Episodic Latent MABs [75.17357040707347]
We consider a multi-armed bandit problem with $M$ latent contexts, where an agent interacts with the environment for an episode of $H$ time steps.
Depending on the length of the episode, the learner may not be able to estimate accurately the latent context.
We design a procedure that provably learns a near-optimal policy with $O(textttpoly(A) + texttttpoly(M,H)min(M,H))$ interactions.
arXiv Detail & Related papers (2022-10-05T22:53:46Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive
Multi-Step Bootstrap [84.66885506098724]
This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB)
We show AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps.
We also show AMB suffers an additional $frac|Z_mul|Delta_min$ regret, where $Z_mul$ is the set of state-action pairs $(s,a)$'s satisfying $a$ is a non-unique optimal action for
arXiv Detail & Related papers (2021-02-09T07:46:34Z) - An Analysis of Frame-skipping in Reinforcement Learning [13.680685626360903]
On many Atari console games, reinforcement learning algorithms deliver substantially better policies when run with $d > 1$.
We focus on "action-repetition", the common restriction of this choice to $d$-length sequences of the same action.
We show that this loss may be offset by the gain brought to learning by a smaller task horizon.
arXiv Detail & Related papers (2021-02-07T04:59:09Z) - Improved Sample Complexity for Incremental Autonomous Exploration in
MDPs [132.88757893161699]
We learn the set of $epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps.
DisCo is the first algorithm that can return an $epsilon/c_min$-optimal policy for any cost-sensitive shortest-path problem.
arXiv Detail & Related papers (2020-12-29T14:06:09Z) - Improving Robustness and Generality of NLP Models Using Disentangled
Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$.
We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning.
We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z) - A No-Free-Lunch Theorem for MultiTask Learning [19.645741778058227]
We consider a seemingly favorable classification scenario where all tasks $P_t$ share a common optimal classifier $h*,$.
We show that, even though such regimes admit minimax rates accounting for both $n$ and $N$, no adaptive algorithm exists.
arXiv Detail & Related papers (2020-06-29T03:03:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.