Related papers: Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos

Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos

URL: http://arxiv.org/abs/2410.09286v1
Date: Fri, 11 Oct 2024 22:31:39 GMT
Title: Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos
Authors: Harsh Mahesheka, Zhixian Xie, Zhaoran Wang, Wanxin Jin,
Abstract summary: We introduce a language-model-assisted bi-level programming framework that enables a reinforcement learning agent to learn its reward from internet videos. The framework includes two levels: an upper level where a vision-language model (VLM) provides feedback by comparing the learner's behavior with expert videos, and a lower level where a large language model (LLM) translates this feedback into reward updates. We validate the method for reward learning from YouTube videos, and the results have shown that the proposed method enables efficient reward design from expert videos of biological agents.
Score: 48.2044649011213
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning from Demonstrations, particularly from biological experts like humans and animals, often encounters significant data acquisition challenges. While recent approaches leverage internet videos for learning, they require complex, task-specific pipelines to extract and retarget motion data for the agent. In this work, we introduce a language-model-assisted bi-level programming framework that enables a reinforcement learning agent to directly learn its reward from internet videos, bypassing dedicated data preparation. The framework includes two levels: an upper level where a vision-language model (VLM) provides feedback by comparing the learner's behavior with expert videos, and a lower level where a large language model (LLM) translates this feedback into reward updates. The VLM and LLM collaborate within this bi-level framework, using a "chain rule" approach to derive a valid search direction for reward learning. We validate the method for reward learning from YouTube videos, and the results have shown that the proposed method enables efficient reward design from expert videos of biological agents for complex behavior synthesis.

Related papers

Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval [53.54695034420311]
In practice, videos are typically untrimmed in long durations with much more complicated background content.<n>We propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model.<n>Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets.
arXiv Detail & Related papers (2025-10-14T08:38:20Z)
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z)
Video2Reward: Generating Reward Function from Videos for Legged Robot Behavior Learning [27.233232260388682]
We introduce a new video2reward method, which directly generates reward functions from videos depicting the behaviors to be mimicked and learned. Our method surpasses the performance of state-of-the-art LLM-based reward generation methods by over 37.6% in terms of human normalized score.
arXiv Detail & Related papers (2024-12-07T03:10:27Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z)
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM. For learning methods, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z)
Reinforcement Learning Friendly Vision-Language Model for Minecraft [31.863271032186038]
We propose a novel cross-modal contrastive learning framework architecture, CLIP4MC. We aim to learn a reinforcement learning (RL) friendly vision-language model (VLM) that serves as an intrinsic reward function for open-ended tasks. We demonstrate that the proposed method achieves better performance on RL tasks compared with baselines.
arXiv Detail & Related papers (2023-03-19T05:20:52Z)
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. We evaluate various baseline methods with and without large-scale VidL pre-training. The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)
Learning from Weakly-labeled Web Videos via Exploring Sub-Concepts [89.06560404218028]
We introduce a new method for pre-training video action recognition models using queried web videos. Instead of trying to filter out, we propose to convert the potential noises in these queried videos to useful supervision signals. We show that SPL outperforms several existing pre-training strategies using pseudo-labels.
arXiv Detail & Related papers (2021-01-11T05:50:16Z)
Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.