Related papers: R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning

R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning

URL: http://arxiv.org/abs/2506.21980v3
Date: Tue, 22 Jul 2025 15:39:40 GMT
Title: R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning
Authors: Biao Wang, Wenwen Li, Jiawei Ge,
Abstract summary: Single object tracking aims to continuously localize and estimate the scale of a target in subsequent video frames.<n>Qwen2.5-VL struggles with template matching between image pairs.<n>Inspired by deep-seek-R1, we fine-tuned Qwen2.5-VL using the group relative policy optimization (GRPO) reinforcement learning method.<n>The resulting model, R1-Track, achieved notable performance on the GOT-10k benchmark.
Score: 5.59181512260003
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual single object tracking aims to continuously localize and estimate the scale of a target in subsequent video frames, given only its initial state in the first frame. This task has traditionally been framed as a template matching problem, evolving through major phases including correlation filters, two-stream networks, and one-stream networks with significant progress achieved. However, these methods typically require explicit classification and regression modeling, depend on supervised training with large-scale datasets, and are limited to the single task of tracking, lacking flexibility. In recent years, multi-modal large language models (MLLMs) have advanced rapidly. Open-source models like Qwen2.5-VL, a flagship MLLMs with strong foundational capabilities, demonstrate excellent performance in grounding tasks. This has spurred interest in applying such models directly to visual tracking. However, experiments reveal that Qwen2.5-VL struggles with template matching between image pairs (i.e., tracking tasks). Inspired by deepseek-R1, we fine-tuned Qwen2.5-VL using the group relative policy optimization (GRPO) reinforcement learning method on a small-scale dataset with a rule-based reward function. The resulting model, R1-Track, achieved notable performance on the GOT-10k benchmark. R1-Track supports flexible initialization via bounding boxes or text descriptions while retaining most of the original model's general capabilities. And we further discuss potential improvements for R1-Track. This rough technical report summarizes our findings as of May 2025.

Related papers

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z)
OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding [7.191306516610944]
We propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively.<n>We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.
arXiv Detail & Related papers (2025-11-21T08:54:49Z)
Detect Anything via Next Point Prediction [51.55967987350882]
Rex- Omni is a 3B-scale MLLM that achieves state-of-the-art object perception performance.<n>On benchmarks like COCO and LVIS, Rex- Omni attains performance comparable to or exceeding regression-based models.
arXiv Detail & Related papers (2025-10-14T17:59:54Z)
CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation [67.1520483301709]
CronusVLA is a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage.<n>CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO.
arXiv Detail & Related papers (2025-06-24T17:30:27Z)
Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning [3.364797975300393]
We present Observe-R1, a novel framework aimed at enhancing the reasoning capabilities of multimodal large language models (MLLMs)<n>We construct the NeuraLadder dataset, which is organized and sampled according to the difficulty and complexity of data samples for RL training.<n>Experiments with the Qwen2.5-VL-3B and Qwen2.5-VL-7B models on 20k samples from the NeuraLadder dataset show that Observe-R1 outperforms a series of larger reasoning models on both reasoning and general benchmarks.
arXiv Detail & Related papers (2025-05-18T14:08:03Z)
Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next-token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall.
arXiv Detail & Related papers (2025-04-18T10:46:22Z)
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models [38.41524186248607]
We introduce NV-Embed, incorporating architectural designs, training procedures, and curated datasets.<n>For model architecture, we propose a latent attention layer to obtain pooled embeddings.<n>For training algorithm, we introduce a two-stage contrastive instruction-tuning method.
arXiv Detail & Related papers (2024-05-27T17:59:45Z)
Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task. Our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z)
Simple Cues Lead to a Strong Multi-Object Tracker [3.7189423451031356]
We propose a new type of tracking-by-detection (TbD) for Multi-Object Tracking. We show that a combination of our appearance features with a simple motion model leads to strong tracking results. Our tracker generalizes to four public datasets, namely MOT17, MOT20, BDD100k, and DanceTrack, achieving state-of-the-art performance.
arXiv Detail & Related papers (2022-06-09T17:55:51Z)
Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework [76.70603443624012]
We propose a novel one-stream tracking (OSTrack) framework that unifies feature learning and relation modeling. In this way, discriminative target-oriented features can be dynamically extracted by mutual guidance. OSTrack achieves state-of-the-art performance on multiple benchmarks, in particular, it shows impressive results on the one-shot tracking benchmark GOT-10k.
arXiv Detail & Related papers (2022-03-22T18:37:11Z)
Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks. Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients. We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z)
Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence. This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time. Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.