Weakly-supervised Latent Models for Task-specific Visual-Language Control
- URL: http://arxiv.org/abs/2511.18319v1
- Date: Sun, 23 Nov 2025 07:18:28 GMT
- Title: Weakly-supervised Latent Models for Task-specific Visual-Language Control
- Authors: Xian Yeow Lee, Lasitha Vidyaratne, Gregory Sin, Ahmed Farahat, Chetan Gupta,
- Abstract summary: We propose a task-specific latent dynamics model that learns state-specific action-induced shifts in a shared latent space using only goal-state supervision.<n>In experiments, our approach achieves 71% success and generalizes to unseen images and instructions.
- Score: 2.10305808315957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autonomous inspection in hazardous environments requires AI agents that can interpret high-level goals and execute precise control. A key capability for such agents is spatial grounding, for example when a drone must center a detected object in its camera view to enable reliable inspection. While large language models provide a natural interface for specifying goals, using them directly for visual control achieves only 58\% success in this task. We envision that equipping agents with a world model as a tool would allow them to roll out candidate actions and perform better in spatially grounded settings, but conventional world models are data and compute intensive. To address this, we propose a task-specific latent dynamics model that learns state-specific action-induced shifts in a shared latent space using only goal-state supervision. The model leverages global action embeddings and complementary training losses to stabilize learning. In experiments, our approach achieves 71\% success and generalizes to unseen images and instructions, highlighting the potential of compact, domain-specific latent dynamics models for spatial alignment in autonomous inspection.
Related papers
- Universal Pose Pretraining for Generalizable Vision-Language-Action Policies [83.39008378156647]
Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency.<n>We propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors.<n>Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment.
arXiv Detail & Related papers (2026-02-23T11:00:08Z) - PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention [92.85371254435074]
PosA-VLA framework anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions.<n>We show that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks.
arXiv Detail & Related papers (2025-12-03T12:14:29Z) - Exploring Conditions for Diffusion models in Robotic Control [70.27711404291573]
We explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control.<n>We find that naively applying textual conditions yields minimal or even negative gains in control tasks.<n>We propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details.
arXiv Detail & Related papers (2025-10-17T10:24:14Z) - Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation [70.8381970762877]
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning.<n>We introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts.<n>G GRACE provides a unified and interpretable interface between high-level instruction understanding and low-level robot control.
arXiv Detail & Related papers (2025-10-09T09:08:33Z) - SemNav: A Model-Based Planner for Zero-Shot Object Goal Navigation Using Vision-Foundation Models [10.671262416557704]
Vision Foundation Models (VFMs) offer powerful capabilities for visual understanding and reasoning.<n>We present a zero-shot object goal navigation framework that integrates the perceptual strength of VFMs with a model-based planner.<n>We evaluate our approach on the HM3D dataset using the Habitat simulator and demonstrate that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-06-04T03:04:54Z) - Representing Positional Information in Generative World Models for Object Manipulation [12.263162194821787]
We introduce a general approach that empowers world model-based agents to solve object-positioning tasks.
In particular, LCP employs object-centric latent representations that explicitly capture object positional information for goal specification.
Our methods are rigorously evaluated across several manipulation environments, showing favorable performance compared to current model-based control approaches.
arXiv Detail & Related papers (2024-09-18T14:19:50Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching.
Our approach learns entirely using offline, unlabeled data.
We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z) - Goal-Aware Prediction: Learning to Model What Matters [105.43098326577434]
One of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model and that of the downstream planner or policy.
We propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space.
We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.
arXiv Detail & Related papers (2020-07-14T16:42:59Z) - Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill
Primitives [89.34229413345541]
We propose a conditioning scheme which avoids pitfalls by learning the controller and its conditioning in an end-to-end manner.
Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion.
We report significant improvements in task success over representative MPC and IL baselines.
arXiv Detail & Related papers (2020-03-19T15:04:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.