PAIL: Performance based Adversarial Imitation Learning Engine for Carbon Neutral Optimization
- URL: http://arxiv.org/abs/2407.08910v1
- Date: Fri, 12 Jul 2024 01:06:01 GMT
- Title: PAIL: Performance based Adversarial Imitation Learning Engine for Carbon Neutral Optimization
- Authors: Yuyang Ye, Lu-An Tang, Haoyu Wang, Runlong Yu, Wenchao Yu, Erhu He, Haifeng Chen, Hui Xiong,
- Abstract summary: Existing Deep Reinforcement Learning (DRL) methods need a pre-defined reward function to assess the impact of each action on the final sustainable development goals.
This study proposes a Performance based Adversarial Learning (PAIL) engine.
It is a novel method to acquire optimal operational policies for carbon neutrality without any pre-defined action rewards.
- Score: 42.9492993819955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving carbon neutrality within industrial operations has become increasingly imperative for sustainable development. It is both a significant challenge and a key opportunity for operational optimization in industry 4.0. In recent years, Deep Reinforcement Learning (DRL) based methods offer promising enhancements for sequential optimization processes and can be used for reducing carbon emissions. However, existing DRL methods need a pre-defined reward function to assess the impact of each action on the final sustainable development goals (SDG). In many real applications, such a reward function cannot be given in advance. To address the problem, this study proposes a Performance based Adversarial Imitation Learning (PAIL) engine. It is a novel method to acquire optimal operational policies for carbon neutrality without any pre-defined action rewards. Specifically, PAIL employs a Transformer-based policy generator to encode historical information and predict following actions within a multi-dimensional space. The entire action sequence will be iteratively updated by an environmental simulator. Then PAIL uses a discriminator to minimize the discrepancy between generated sequences and real-world samples of high SDG. In parallel, a Q-learning framework based performance estimator is designed to estimate the impact of each action on SDG. Based on these estimations, PAIL refines generated policies with the rewards from both discriminator and performance estimator. PAIL is evaluated on multiple real-world application cases and datasets. The experiment results demonstrate the effectiveness of PAIL comparing to other state-of-the-art baselines. In addition, PAIL offers meaningful interpretability for the optimization in carbon neutrality.
Related papers
- Learning Explainable Dense Reward Shapes via Bayesian Optimization [45.34810347865996]
We frame reward shaping as an optimization problem focused on token-level credit assignment.
We use explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model.
Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines.
arXiv Detail & Related papers (2025-04-22T21:09:33Z) - Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance.
We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization [23.817251267022847]
We propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue.
BSPO reduces the generation of OOD responses during the reinforcement learning process.
Empirical results show that BSPO outperforms baselines in preventing reward over-optimization.
arXiv Detail & Related papers (2025-03-23T16:20:59Z) - SortingEnv: An Extendable RL-Environment for an Industrial Sorting Process [0.0]
We present a novel reinforcement learning (RL) environment designed to both optimize industrial sorting systems and study agent behavior in evolving spaces.
In simulating material flow within a sorting process our environment follows the idea of a digital twin, with operational parameters like belt speed and occupancy level.
It thus includes two variants: a basic version focusing on discrete belt speed adjustments and an advanced version introducing multiple sorting modes and enhanced material composition observations.
arXiv Detail & Related papers (2025-03-13T15:38:25Z) - Offline Behavior Distillation [57.6900189406964]
Massive reinforcement learning (RL) data are typically collected to train policies offline without the need for interactions.
We formulate offline behavior distillation (OBD), which synthesizes limited expert behavioral data from sub-optimal RL data.
We propose two naive OBD objectives, DBC and PBC, which measure distillation performance via the decision difference between policies trained on distilled data and either offline data or a near-expert policy.
arXiv Detail & Related papers (2024-10-30T06:28:09Z) - Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [90.23629291067763]
A promising approach for improving reasoning in large language models is to use process reward models (PRMs)
PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs)
To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?"
We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL.
arXiv Detail & Related papers (2024-10-10T17:31:23Z) - Memory-Enhanced Neural Solvers for Efficient Adaptation in Combinatorial Optimization [6.713974813995327]
We present MEMENTO, an approach that leverages memory to improve the adaptation of neural solvers at time.
We successfully train all RL auto-regressive solvers on large instances, and show that MEMENTO can scale and is data-efficient.
Overall, MEMENTO enables to push the state-of-the-art on 11 out of 12 evaluated tasks.
arXiv Detail & Related papers (2024-06-24T08:18:19Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Assessment of Reinforcement Learning Algorithms for Nuclear Power Plant
Fuel Optimization [0.0]
This work presents a first-of-a-kind approach to utilize deep RL to solve the loading pattern problem and could be leveraged for any engineering design optimization.
arXiv Detail & Related papers (2023-05-09T23:51:24Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Toward Efficient Automated Feature Engineering [27.47868891738917]
Automated Feature Engineering (AFE) refers to automatically generate and select optimal feature sets for downstream tasks.
Current AFE methods mainly focus on improving the effectiveness of the produced features, but ignoring the low-efficiency issue for large-scale deployment.
We construct the AFE pipeline based on reinforcement learning setting, where each feature is assigned an agent to perform feature transformation.
We conduct comprehensive experiments on 36 datasets in terms of both classification and regression tasks.
arXiv Detail & Related papers (2022-12-26T13:18:51Z) - Reinforcement Learning in the Wild: Scalable RL Dispatching Algorithm
Deployed in Ridehailing Marketplace [12.298997392937876]
This study proposes a real-time dispatching algorithm based on reinforcement learning.
It is deployed online in multiple cities under DiDi's operation for A/B testing and is launched in one of the major international markets.
The deployed algorithm shows over 1.3% improvement in total driver income from A/B testing.
arXiv Detail & Related papers (2022-02-10T16:07:17Z) - On Computation and Generalization of Generative Adversarial Imitation
Learning [134.17122587138897]
Generative Adversarial Learning (GAIL) is a powerful and practical approach for learning sequential decision-making policies.
This paper investigates the theoretical properties of GAIL.
arXiv Detail & Related papers (2020-01-09T00:40:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.