Guide Your Agent with Adaptive Multimodal Rewards
- URL: http://arxiv.org/abs/2309.10790v2
- Date: Wed, 25 Oct 2023 08:39:38 GMT
- Title: Guide Your Agent with Adaptive Multimodal Rewards
- Authors: Changyeon Kim, Younggyo Seo, Hao Liu, Lisa Lee, Jinwoo Shin, Honglak
Lee, Kimin Lee
- Abstract summary: This work presents Adaptive Return-conditioned Policy (ARP), an efficient framework to enhance the agent's generalization ability.
Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space.
Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization.
- Score: 107.08768813632032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing an agent capable of adapting to unseen environments remains a
difficult challenge in imitation learning. This work presents Adaptive
Return-conditioned Policy (ARP), an efficient framework designed to enhance the
agent's generalization ability using natural language task descriptions and
pre-trained multimodal encoders. Our key idea is to calculate a similarity
between visual observations and natural language instructions in the
pre-trained multimodal embedding space (such as CLIP) and use it as a reward
signal. We then train a return-conditioned policy using expert demonstrations
labeled with multimodal rewards. Because the multimodal rewards provide
adaptive signals at each timestep, our ARP effectively mitigates the goal
misgeneralization. This results in superior generalization performances even
when faced with unseen text instructions, compared to existing text-conditioned
policies. To improve the quality of rewards, we also introduce a fine-tuning
method for pre-trained multimodal encoders, further enhancing the performance.
Video demonstrations and source code are available on the project website:
\url{https://sites.google.com/view/2023arp}.
Related papers
- CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing [70.25689961697523]
We propose a generalizable algorithm that enhances sequential reasoning by cross-task experience sharing and selection.
Our work bridges the gap between existing sequential reasoning paradigms and validates the effectiveness of leveraging cross-task experiences.
arXiv Detail & Related papers (2024-10-22T03:59:53Z) - M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action
Recognition [39.92547393649842]
We introduce a novel Multimodal, Multi-task CLIP adapting framework named name to address these challenges.
We demonstrate exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.
arXiv Detail & Related papers (2024-01-22T02:03:31Z) - Value Explicit Pretraining for Learning Transferable Representations [11.069853883599102]
We propose a method that learns generalizable representations for transfer reinforcement learning.
We learn new tasks that share similar objectives as previously learned tasks, by learning an encoder for objective-conditioned representations.
Experiments using a realistic navigation simulator and Atari benchmark show that the pretrained encoder produced by our method outperforms current SoTA pretraining methods.
arXiv Detail & Related papers (2023-12-19T17:12:35Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments.
Our method consists of a multimodal transformer that encodes visual observations and language instructions.
We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z) - PsiPhi-Learning: Reinforcement Learning with Demonstrations using
Successor Features and Inverse Temporal Difference Learning [102.36450942613091]
We propose an inverse reinforcement learning algorithm, called emphinverse temporal difference learning (ITD)
We show how to seamlessly integrate ITD with learning from online environment interactions, arriving at a novel algorithm for reinforcement learning with demonstrations, called $Psi Phi$-learning.
arXiv Detail & Related papers (2021-02-24T21:12:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.