Vision-Language Models as Success Detectors
- URL: http://arxiv.org/abs/2303.07280v1
- Date: Mon, 13 Mar 2023 16:54:11 GMT
- Title: Vision-Language Models as Success Detectors
- Authors: Yuqing Du, Ksenia Konyushkova, Misha Denil, Akhil Raju, Jessica
Landon, Felix Hill, Nando de Freitas, Serkan Cabi
- Abstract summary: We study success detection across three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii) "in-the-wild" human egocentric videos.
We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models with either variation.
In the last domain of "in-the-wild" human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting
- Score: 22.04312297048653
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting successful behaviour is crucial for training intelligent agents. As
such, generalisable reward models are a prerequisite for agents that can learn
to generalise their behaviour. In this work we focus on developing robust
success detectors that leverage large, pretrained vision-language models
(Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we
treat success detection as a visual question answering (VQA) problem, denoted
SuccessVQA. We study success detection across three vastly different domains:
(i) interactive language-conditioned agents in a simulated household, (ii) real
world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We
investigate the generalisation properties of a Flamingo-based success detection
model across unseen language and visual changes in the first two domains, and
find that the proposed method is able to outperform bespoke reward models in
out-of-distribution test scenarios with either variation. In the last domain of
"in-the-wild" human videos, we show that success detection on unseen real
videos presents an even more challenging generalisation task warranting future
work. We hope our initial results encourage further work in real world success
detection and reward modelling.
Related papers
- Affordance-Guided Reinforcement Learning via Visual Prompting [51.361977466993345]
We study rewards shaped by vision-language models (VLMs) to define dense rewards for robotic learning.
On a real-world manipulation task specified by natural language description, we find that these rewards improve the sample efficiency of autonomous RL.
arXiv Detail & Related papers (2024-07-14T21:41:29Z) - Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations [77.31328397965653]
We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting challenges through two key innovations.
A novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability.
An agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object.
arXiv Detail & Related papers (2024-04-26T16:40:17Z) - RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback [24.759613248409167]
Reward engineering has long been a challenge in Reinforcement Learning research.
We propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks.
We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains.
arXiv Detail & Related papers (2024-02-06T04:06:06Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - A Two-stage Fine-tuning Strategy for Generalizable Manipulation Skill of
Embodied AI [15.480968464853769]
We propose a novel two-stage fine-tuning strategy to enhance the generalization capability of our model based on the Maniskill2 benchmark.
Our findings highlight the potential of our method to improve the generalization abilities of Embodied AI models and pave the way for their ractical applications in real-world scenarios.
arXiv Detail & Related papers (2023-07-21T04:15:36Z) - Human-Timescale Adaptation in an Open-Ended Task Space [56.55530165036327]
We show that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans.
Our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains.
arXiv Detail & Related papers (2023-01-18T15:39:21Z) - H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding
Object Articulations from Interactions [62.510951695174604]
"Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations.
We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework.
We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
arXiv Detail & Related papers (2022-10-22T18:39:33Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - Win-Fail Action Recognition [4.56877715768796]
We introduce the task of win-fail action recognition differentiating -- between successful and failed attempts at various activities.
Unlike existing action recognition datasets, intra-class variation is high making the task challenging, yet feasible.
We systematically analyze the characteristics of the win-fail task/dataset with prototypical action recognition networks and a novel video retrieval task.
arXiv Detail & Related papers (2021-02-15T06:03:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.