Related papers: Vision-Language Models as Success Detectors

Vision-Language Models as Success Detectors

URL: http://arxiv.org/abs/2303.07280v1
Date: Mon, 13 Mar 2023 16:54:11 GMT
Title: Vision-Language Models as Success Detectors
Authors: Yuqing Du, Ksenia Konyushkova, Misha Denil, Akhil Raju, Jessica Landon, Felix Hill, Nando de Freitas, Serkan Cabi
Abstract summary: We study success detection across three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models with either variation. In the last domain of "in-the-wild" human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting
Score: 22.04312297048653
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language models (Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we treat success detection as a visual question answering (VQA) problem, denoted SuccessVQA. We study success detection across three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models in out-of-distribution test scenarios with either variation. In the last domain of "in-the-wild" human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting future work. We hope our initial results encourage further work in real world success detection and reward modelling.

Related papers

Active Intelligence in Video Avatars via Closed-loop World Modeling [55.29966567726842]
Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency.<n>We introduce L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in generative environments.<n>We also present ORCA, the first framework enabling active intelligence in video avatars.
arXiv Detail & Related papers (2025-12-23T18:59:16Z)
Hierarchical Vision Language Action Model Using Success and Failure Demonstrations [60.82332413442677]
We introduce VINE, a hierarchical vision-language-action model that separates high-level reasoning from low-level control.<n>System 2 performs feasibility-guided tree search over a 2D scene-graph abstraction.<n>System 1 executes low-level actions without modifying the agent's core skills.
arXiv Detail & Related papers (2025-12-03T15:58:38Z)
Consistent Zero-Shot Imitation with Contrastive Goal Inference [30.726311787096435]
A prerequisite for embodied agents deployed in real world interactions ought to be training with interaction.<n>Key contribution of our paper is a method for pre-training interactive agents in a self-supervised fashion.
arXiv Detail & Related papers (2025-10-20T00:28:03Z)
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z)
Evaluating Uncertainty and Quality of Visual Language Action-enabled Robots [13.26825865228582]
We propose eight uncertainty metrics and five quality metrics specifically designed for VLA models for robotic manipulation tasks.<n>We assess their effectiveness through a large-scale empirical study involving 908 successful task executions from three state-of-the-art VLA models.
arXiv Detail & Related papers (2025-07-22T22:15:59Z)
VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences [13.337649128532307]
Preference-based RL mitigates some of these pitfalls by learning rewards from comparative feedback. A single final-state image generally fails to capture the agent's full motion. We present a two-part solution that both improves feedback accuracy and better aligns reward learning with the agent's policy.
arXiv Detail & Related papers (2025-03-18T01:51:27Z)
ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z)
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations [77.31328397965653]
We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting challenges through two key innovations. A novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability. An agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object.
arXiv Detail & Related papers (2024-04-26T16:40:17Z)
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback [24.759613248409167]
Reward engineering has long been a challenge in Reinforcement Learning research. We propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks. We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains.
arXiv Detail & Related papers (2024-02-06T04:06:06Z)
What Makes Pre-Trained Visual Representations Successful for Robust Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture. We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z)
A Two-stage Fine-tuning Strategy for Generalizable Manipulation Skill of Embodied AI [15.480968464853769]
We propose a novel two-stage fine-tuning strategy to enhance the generalization capability of our model based on the Maniskill2 benchmark. Our findings highlight the potential of our method to improve the generalization abilities of Embodied AI models and pave the way for their ractical applications in real-world scenarios.
arXiv Detail & Related papers (2023-07-21T04:15:36Z)
Human-Timescale Adaptation in an Open-Ended Task Space [56.55530165036327]
We show that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. Our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains.
arXiv Detail & Related papers (2023-01-18T15:39:21Z)
H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding Object Articulations from Interactions [62.510951695174604]
"Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations. We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework. We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
arXiv Detail & Related papers (2022-10-22T18:39:33Z)
SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method. We modernize the 3D convolutional backbone by introducing multi-head self-attention modules. In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z)
Win-Fail Action Recognition [4.56877715768796]
We introduce the task of win-fail action recognition differentiating -- between successful and failed attempts at various activities. Unlike existing action recognition datasets, intra-class variation is high making the task challenging, yet feasible. We systematically analyze the characteristics of the win-fail task/dataset with prototypical action recognition networks and a novel video retrieval task.
arXiv Detail & Related papers (2021-02-15T06:03:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.