Related papers: Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models

Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models

URL: http://arxiv.org/abs/2511.17681v1
Date: Fri, 21 Nov 2025 08:53:31 GMT
Title: Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models
Authors: Weiyi Lv, Ning Zhang, Hanyang Sun, Haoran Jiang, Kai Zhao, Jing Xiao, Dan Zeng,
Abstract summary: We propose a novel Vision-Motion-Reference aligned RMOT framework, named VMRMOT.<n>It integrates a motion modality extracted from object dynamics to enhance the alignment between vision modality and language references.<n>To the best of our knowledge, VMRMOT is the first approach to employ MLLMs in the RMOT task for vision-reference alignment.
Score: 29.330083952817997
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring Multi-Object Tracking (RMOT) extends conventional multi-object tracking (MOT) by introducing natural language references for multi-modal fusion tracking. RMOT benchmarks only describe the object's appearance, relative positions, and initial motion states. This so-called static regulation fails to capture dynamic changes of the object motion, including velocity changes and motion direction shifts. This limitation not only causes a temporal discrepancy between static references and dynamic vision modality but also constrains multi-modal tracking performance. To address this limitation, we propose a novel Vision-Motion-Reference aligned RMOT framework, named VMRMOT. It integrates a motion modality extracted from object dynamics to enhance the alignment between vision modality and language references through multi-modal large language models (MLLMs). Specifically, we introduce motion-aware descriptions derived from object dynamic behaviors and, leveraging the powerful temporal-reasoning capabilities of MLLMs, extract motion features as the motion modality. We further design a Vision-Motion-Reference Alignment (VMRA) module to hierarchically align visual queries with motion and reference cues, enhancing their cross-modal consistency. In addition, a Motion-Guided Prediction Head (MGPH) is developed to explore motion modality to enhance the performance of the prediction head. To the best of our knowledge, VMRMOT is the first approach to employ MLLMs in the RMOT task for vision-reference alignment. Extensive experiments on multiple RMOT benchmarks demonstrate that VMRMOT outperforms existing state-of-the-art methods.

Related papers

LLMTrack: Semantic Multi-Object Tracking with Multi-modal Large Language Models [7.6967194010564235]
We propose textbfLLMTrack, a novel end-to-end framework for Semantic Multi-Object Tracking (SMOT)<n>We adopt a bionic design philosophy that decouples strong localization from deep understanding, utilizing Grounding DINO as the eyes and the LLaVA-OneVision multimodal large model as the brain.
arXiv Detail & Related papers (2026-01-10T12:18:12Z)
MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing [53.98607267063729]
MotionVerse is a framework to comprehend, generate, and edit human motion in both single-person and multi-person scenarios.<n>We employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens.<n>We also introduce a textitDelay Parallel Modeling strategy, which temporally staggers the encoding of residual token streams.
arXiv Detail & Related papers (2025-09-28T04:20:56Z)
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework.<n>V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios.<n>We show V-MAGE provides actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.
arXiv Detail & Related papers (2025-04-08T15:43:01Z)
C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation [81.4106601222722]
Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation.<n>We propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag.<n>Our method includes an object perception module and a Chain-of-Thought-based motion reasoning module.
arXiv Detail & Related papers (2025-02-27T08:21:03Z)
VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension [26.172040706657235]
We introduce VersatileMotion, a unified motion LLM that combines a novel motion tokenizer, integrating VQ-VAE with flow matching, and an autoregressive transformer backbone.<n> VersatileMotion is the first method to handle single-agent and multi-agent motions in a single framework, achieving state-of-the-art performance on seven of these tasks.
arXiv Detail & Related papers (2024-11-26T11:28:01Z)
MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding [76.30210465222218]
MotionGPT-2 is a unified Large Motion-Language Model (LMLMLM) It supports multimodal control conditions through pre-trained Large Language Models (LLMs) It is highly adaptable to the challenging 3D holistic motion generation task.
arXiv Detail & Related papers (2024-10-29T05:25:34Z)
MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete Representations [25.630268570049708]
MoConVQ is a novel unified framework for physics-based motion control leveraging scalable discrete representations. Our approach effectively learns motion embeddings from a large, unstructured dataset spanning tens of hours of motion examples.
arXiv Detail & Related papers (2023-10-16T09:09:02Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
MotionTrack: Learning Motion Predictor for Multiple Object Tracking [68.68339102749358]
We introduce a novel motion-based tracker, MotionTrack, centered around a learnable motion predictor. Our experimental results demonstrate that MotionTrack yields state-of-the-art performance on datasets such as Dancetrack and SportsMOT.
arXiv Detail & Related papers (2023-06-05T04:24:11Z)
Interactive Character Control with Auto-Regressive Motion Diffusion Models [18.727066177880708]
We propose A-MDM (Auto-regressive Motion Diffusion Model) for real-time motion synthesis. Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on previous frame. We introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning.
arXiv Detail & Related papers (2023-06-01T07:48:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.