ReMoT: Reinforcement Learning with Motion Contrast Triplets
- URL: http://arxiv.org/abs/2603.00461v1
- Date: Sat, 28 Feb 2026 04:42:34 GMT
- Title: ReMoT: Reinforcement Learning with Motion Contrast Triplets
- Authors: Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong,
- Abstract summary: We present ReMoT, a unified training paradigm to address the fundamental shortcomings of VLMs in-temporal.<n>A rule-based automatic framework generates ReMoT-16K triplets, a large-scale (1K) motion-contrast triplets from video meta-annotations.<n>We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLMs of subtle discrimination.
- Score: 37.29312323908102
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
Related papers
- IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation [54.36300724708094]
Assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation.<n>We introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance.
arXiv Detail & Related papers (2025-12-11T15:16:06Z) - Automating Benchmark Design [17.34266257717423]
We develop BeTaL, a framework that automates the process of dynamic benchmark design.<n>We create two new benchmarks and extend a popular agentic benchmark.<n>BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2%.
arXiv Detail & Related papers (2025-10-28T23:53:36Z) - UniVid: The Open-Source Unified Video Model [41.15980565061684]
We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter.<n>Experiments on standard benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2025-09-29T02:31:36Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling [84.51372201195132]
CronusVLA is a unified framework that extends single-frame VLA models to the multi-frame paradigm.<n>CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate.<n>These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
arXiv Detail & Related papers (2025-06-24T17:30:27Z) - ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z) - Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks.<n>By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z) - TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos [17.41208629642756]
This study proposes TB-Bench, a benchmark to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views.<n>We also introduce vision- instruction tuning, TB-100k and TB-250k, along with simple yet effective baselines for the tasks.<n>In contrast, when fine-tuned with TB-100k or TB-250k, our baseline models achieve average accuracy up to 85%, significantly enhancing performance on the tasks.
arXiv Detail & Related papers (2025-01-10T06:02:06Z) - Data-Driven Approaches for Modelling Target Behaviour [1.5495593104596401]
The performance of tracking algorithms depends on the chosen model assumptions regarding the target dynamics.
This paper provides a comparative study between three different methods that use machine learning to describe the underlying object motion.
arXiv Detail & Related papers (2024-10-14T14:18:27Z) - ProMotion: Prototypes As Motion Learners [46.08051377180652]
We introduce ProMotion, a unified prototypical framework engineered to model fundamental motion tasks.
ProMotion offers a range of compelling attributes that set it apart from current task-specific paradigms.
We capitalize on a dual mechanism involving the feature denoiser and the prototypical learner to decipher the intricacies of motion.
arXiv Detail & Related papers (2024-06-07T15:10:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.