Related papers: Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents

Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents

URL: http://arxiv.org/abs/2502.01218v2
Date: Thu, 22 May 2025 08:03:17 GMT
Title: Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents
Authors: Zhizhen Zhang, Lei Zhu, Zhen Fang, Zi Huang, Yadan Luo,
Abstract summary: We propose Action Temporal Coherence Learning (AcTOL) to learn ordered and continuous vision-language representations without rigid goal-based constraint.<n>AcTOL treats a video as a continuous trajectory where it (1) contrasts semantic differences between frames to reflect their natural ordering, and (2) imposes a local Brownian bridge constraint to ensure smooth transitions across intermediate frames.
Score: 39.95793203302782
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-training vision-language representations on human action videos has emerged as a promising approach to reduce reliance on large-scale expert demonstrations for training embodied agents. However, prior methods often employ time contrastive learning based on goal-reaching heuristics, progressively aligning language instructions from the initial to the final frame. This overemphasis on future frames can result in erroneous vision-language associations, as actions may terminate early or include irrelevant moments in the end. To address this issue, we propose Action Temporal Coherence Learning (AcTOL) to learn ordered and continuous vision-language representations without rigid goal-based constraint. AcTOL treats a video as a continuous trajectory where it (1) contrasts semantic differences between frames to reflect their natural ordering, and (2) imposes a local Brownian bridge constraint to ensure smooth transitions across intermediate frames. Extensive imitation learning experiments on both simulated and real robots show that the pretrained features significantly enhance downstream manipulation tasks with high robustness to different linguistic styles of instructions, offering a viable pathway toward generalized embodied agents.

Related papers

SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation [1.4175612723267692]
We present the first sign language-driven Vision-Language-Action (VLA) framework for intuitive human-robot interaction.<n>Unlike conventional approaches that rely on gloss annotations as intermediate supervision, the proposed system adopts a gloss-free paradigm.<n>We focus on a real-time alphabet-level finger-spelling interface that provides a robust and low-latency communication channel for robotic control.
arXiv Detail & Related papers (2026-02-26T01:16:27Z)
Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization [20.179748846776995]
We propose ActionVLM, a vision-language aggregation framework that mitigates modality bias in TAL.<n>Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial.<n>Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.
arXiv Detail & Related papers (2026-01-28T22:03:46Z)
CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos [73.51386721543135]
We propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories.<n>CLAP maps video transitions onto a quantized, physically executable codebook.<n>We introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation.
arXiv Detail & Related papers (2026-01-07T16:26:33Z)
Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z)
Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training [39.7658823121591]
ZOMG is a framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning.<n>ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization.<n>Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7% mAP on HumanML3D benchmark.
arXiv Detail & Related papers (2025-11-19T12:11:36Z)
Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations [26.678553477485362]
We present a framework that better preserves pretrained features while adapting them for robot manipulation.<n>Our approach introduces three components: (i) a dual-encoder design with one frozen vision to retain pretrained features and another trainable for task adaptation, (ii) a string-based action tokenizer that casts continuous actions into character sequences aligned with the model's pretraining domain, and (iii) a co-training strategy that combines robot demonstrations with vision-language datasets emphasizing spatial reasoning and affordances.
arXiv Detail & Related papers (2025-09-14T20:08:56Z)
Harnessing Textual Semantic Priors for Knowledge Transfer and Refinement in CLIP-Driven Continual Learning [19.210280671911278]
Continual learning aims to equip models with the ability to learn from a stream of tasks without forgetting previous knowledge.<n>We propose a unified framework that harnesses the anti-forgetting and structured nature of textual priors to guide semantic-aware knowledge transfer.
arXiv Detail & Related papers (2025-08-03T04:09:00Z)
ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
We propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody.<n>We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone.
arXiv Detail & Related papers (2025-07-27T00:59:01Z)
Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization [129.43937834515688]
We propose a new COllaborative Temporal consistEncy Learning (COTEL) framework to strengthen the video-language alignment. Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs.
arXiv Detail & Related papers (2025-03-22T05:04:12Z)
Efficient Transfer Learning for Video-language Foundation Models [13.166348605993292]
We propose a parameter-efficient Multi-modalpatio Ssupervised-Temporal Adapter (MSTA) to enhance alignment between textual and visual representations. We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-novel generalization, and fully-Temporal learning.
arXiv Detail & Related papers (2024-11-18T01:25:58Z)
DEPTH: Discourse Education through Pre-Training Hierarchically [33.89893399779713]
DEPTH is an encoder-decoder model that learns to represent sentences using a discourse-oriented pre-training objective. When trained from scratch or continuing from a pre-trained T5 checkpoint, DEPTH learns semantic and discourse-level representations faster than T5.
arXiv Detail & Related papers (2024-05-13T14:35:30Z)
BID: Boundary-Interior Decoding for Unsupervised Temporal Action Localization Pre-Trainin [13.273908640951252]
We propose the first unsupervised pre-training framework that partitions a skeleton-based motion sequence into semantically meaningful pre-action segments. By fine-tuning our pre-training network with a small number of annotated data, we show results out-performing SOTA methods by a large margin.
arXiv Detail & Related papers (2024-03-12T06:23:45Z)
PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation. Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details. Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z)
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models. We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z)
Language-based Action Concept Spaces Improve Video Self-Supervised Learning [8.746806973828738]
We introduce language tied self-supervised learning to adapt an image CLIP model to the video domain. A backbone modified for temporal modeling is trained under self-distillation settings with train objectives operating in an action concept space. Our approach improves zero-shot and linear probing performance on three action recognition benchmarks.
arXiv Detail & Related papers (2023-07-20T14:47:50Z)
EC^2: Emergent Communication for Embodied Control [72.99894347257268]
Embodied control requires agents to leverage multi-modal pre-training to quickly learn how to act in new environments. We propose Emergent Communication for Embodied Control (EC2), a novel scheme to pre-train video-language representations for few-shot embodied control. EC2 is shown to consistently outperform previous contrastive learning methods for both videos and texts as task inputs.
arXiv Detail & Related papers (2023-04-19T06:36:02Z)
Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target. The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well. We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
arXiv Detail & Related papers (2022-09-10T19:04:40Z)
Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp. SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)
Towards Tokenized Human Dynamics Representation [41.75534387530019]
We study how to segment and cluster videos into recurring temporal patterns in a self-supervised way. We evaluate the frame-wise representation learning step by Kendall's Tau and the lexicon building step by normalized mutual information and language entropy. On the AIST++ and PKU-MMD datasets, actons bring significant performance improvements compared to several baselines.
arXiv Detail & Related papers (2021-11-22T18:59:58Z)
Weakly Supervised Temporal Adjacent Network for Language Grounding [96.09453060585497]
We introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding. WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm. An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising.
arXiv Detail & Related papers (2021-06-30T15:42:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.