Related papers: Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications

Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications

URL: http://arxiv.org/abs/2602.18582v1
Date: Fri, 20 Feb 2026 19:41:17 GMT
Title: Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Authors: Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar,
Abstract summary: We introduce Hierarchical Reward Design from Language (HRDL) to encode richer behavioral specifications for hierarchical reinforcement learning agents.<n>Experiments show that AI agents trained with rewards designed via L2HR not only complete tasks effectively but also better adhere to human specifications.
Score: 4.724825031148412
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed. As AI agents tackle increasingly complex tasks, aligning their behavior with human-provided specifications becomes critical for responsible AI deployment. Reward design provides a direct channel for such alignment by translating human expectations into reward functions that guide reinforcement learning (RL). However, existing methods are often too limited to capture nuanced human preferences that arise in long-horizon tasks. Hence, we introduce Hierarchical Reward Design from Language (HRDL): a problem formulation that extends classical reward design to encode richer behavioral specifications for hierarchical RL agents. We further propose Language to Hierarchical Rewards (L2HR) as a solution to HRDL. Experiments show that AI agents trained with rewards designed via L2HR not only complete tasks effectively but also better adhere to human specifications. Together, HRDL and L2HR advance the research on human-aligned AI agents.

Related papers

Deep Reinforcement Learning Agents are not even close to Human Intelligence [25.836584192349907]
Deep reinforcement learning (RL) agents achieve impressive results in a wide variety of tasks, but they lack zero-shot adaptation capabilities.<n>We introduce HackAtari, a set of task variations of the Arcade Learning Environments.<n>We use it to demonstrate that, contrary to humans, RL agents systematically exhibit huge performance drops on simpler versions of their training tasks.
arXiv Detail & Related papers (2025-05-27T20:21:46Z)
LAMeTA: Intent-Aware Agentic Network Optimization via a Large AI Model-Empowered Two-Stage Approach [68.198383438396]
We present LAMeTA, a Large AI Model (LAM)-empowered Two-stage Approach for intent-aware agentic network optimization.<n>First, we propose Intent-oriented Knowledge Distillation (IoKD), which efficiently distills intent-understanding capabilities.<n>Second, we develop Symbiotic Reinforcement Learning (SRL), integrating E-LAMs with a policy-based DRL framework.
arXiv Detail & Related papers (2025-05-18T05:59:16Z)
Modeling AI-Human Collaboration as a Multi-Agent Adaptation [0.0]
We develop an agent-based simulation to formalize AI-human collaboration as a function of a task.<n>We show that in modular tasks, AI often substitutes for humans - delivering higher payoffs unless human expertise is very high.<n>We also show that even "hallucinatory" AI - lacking memory or structure - can improve outcomes when augmenting low-capability humans by helping escape local optima.
arXiv Detail & Related papers (2025-04-29T16:19:53Z)
Direct Advantage Regression: Aligning LLMs with Online AI Reward [59.78549819431632]
Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF)<n>We propose Direct Advantage Regression (DAR) to optimize policy improvement through weighted supervised fine-tuning.<n>Our empirical results underscore that AI reward is a better form of AI supervision consistently achieving higher human-AI agreement as opposed to AI preference.
arXiv Detail & Related papers (2025-04-19T04:44:32Z)
REvolve: Reward Evolution with Large Language Models using Human Feedback [6.4550546442058225]
Large language models (LLMs) have been used for reward generation from natural language task descriptions.<n>LLMs, guided by human feedback, can be used to formulate reward functions that reflect human implicit knowledge.<n>We introduce REvolve, a truly evolutionary framework that uses LLMs for reward design in reinforcement learning.
arXiv Detail & Related papers (2024-06-03T13:23:27Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Exploration with Principles for Diverse AI Supervision [88.61687950039662]
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI. While this generative AI approach has produced impressive results, it heavily leans on human supervision. This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation. We propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data.
arXiv Detail & Related papers (2023-10-13T07:03:39Z)
SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision. We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z)
Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning [23.062590084580542]
Int-HRL: Hierarchical RL with intention-based sub-goals that are inferred from human eye gaze. Our evaluations show that replacing hand-crafted sub-goals with automatically extracted intentions leads to a HRL agent that is significantly more sample efficient than previous methods.
arXiv Detail & Related papers (2023-06-20T12:12:16Z)
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision [84.31474052176343]
Recent AI-assistant agents, such as ChatGPT, rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback to align the output with human intentions. This dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision. We propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision.
arXiv Detail & Related papers (2023-05-04T17:59:28Z)
Language Instructed Reinforcement Learning for Human-AI Coordination [23.694362407434753]
We propose a novel framework, instructRL, that enables humans to specify what kind of strategies they expect from their AI partners through natural language instructions. We show that instructRL converges to human-like policies that satisfy the given instructions in a proof-of-concept environment and the challenging Hanabi benchmark.
arXiv Detail & Related papers (2023-04-13T04:47:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.