Related papers: ReMoT: Reinforcement Learning with Motion Contrast Triplets

ReMoT: Reinforcement Learning with Motion Contrast Triplets

URL: http://arxiv.org/abs/2603.00461v1
Date: Sat, 28 Feb 2026 04:42:34 GMT
Title: ReMoT: Reinforcement Learning with Motion Contrast Triplets
Authors: Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong,
Abstract summary: We present ReMoT, a unified training paradigm to address the fundamental shortcomings of VLMs in-temporal.<n>A rule-based automatic framework generates ReMoT-16K triplets, a large-scale (1K) motion-contrast triplets from video meta-annotations.<n>We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLMs of subtle discrimination.
Score: 37.29312323908102
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.

Related papers

IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation [54.36300724708094]
Assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation.<n>We introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance.
arXiv Detail & Related papers (2025-12-11T15:16:06Z)
Automating Benchmark Design [17.34266257717423]
We develop BeTaL, a framework that automates the process of dynamic benchmark design.<n>We create two new benchmarks and extend a popular agentic benchmark.<n>BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2%.
arXiv Detail & Related papers (2025-10-28T23:53:36Z)
UniVid: The Open-Source Unified Video Model [41.15980565061684]
We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter.<n>Experiments on standard benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2025-09-29T02:31:36Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling [84.51372201195132]
CronusVLA is a unified framework that extends single-frame VLA models to the multi-frame paradigm.<n>CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate.<n>These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
arXiv Detail & Related papers (2025-06-24T17:30:27Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks.<n>By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z)
TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos [17.41208629642756]
This study proposes TB-Bench, a benchmark to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views.<n>We also introduce vision- instruction tuning, TB-100k and TB-250k, along with simple yet effective baselines for the tasks.<n>In contrast, when fine-tuned with TB-100k or TB-250k, our baseline models achieve average accuracy up to 85%, significantly enhancing performance on the tasks.
arXiv Detail & Related papers (2025-01-10T06:02:06Z)
Data-Driven Approaches for Modelling Target Behaviour [1.5495593104596401]
The performance of tracking algorithms depends on the chosen model assumptions regarding the target dynamics. This paper provides a comparative study between three different methods that use machine learning to describe the underlying object motion.
arXiv Detail & Related papers (2024-10-14T14:18:27Z)
ProMotion: Prototypes As Motion Learners [46.08051377180652]
We introduce ProMotion, a unified prototypical framework engineered to model fundamental motion tasks. ProMotion offers a range of compelling attributes that set it apart from current task-specific paradigms. We capitalize on a dual mechanism involving the feature denoiser and the prototypical learner to decipher the intricacies of motion.
arXiv Detail & Related papers (2024-06-07T15:10:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.