Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge
- URL: http://arxiv.org/abs/2511.03332v1
- Date: Wed, 05 Nov 2025 10:01:31 GMT
- Title: Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge
- Authors: Yi Yang, Yiming Xu, Timo Kaiser, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang,
- Abstract summary: The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries.<n>We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach.<n>On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.
- Score: 42.013930541762484
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.
Related papers
- Perception Test 2025: Challenge Summary and a Unified VQA Extension [56.23039846339896]
Third Perception Test challenge was organised as a full-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2025.<n>Its primary goal is to benchmark state-of-the-art video models and measure the progress in multimodal perception.<n>We summarise the results from the main Perception Test challenge, detailing both the existing tasks as well as novel additions to the benchmark.
arXiv Detail & Related papers (2026-01-09T20:02:21Z) - Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark [64.16672247204997]
We organised the Second Perception Test challenge as a half-day workshop alongside the IEEE/CVF European Conference on Computer Vision (ECCV) 2024.<n>The goal was to benchmarking state-of-the-art video models and measuring the progress since last year using the Perception Test benchmark.<n>This year, the challenge had seven tracks and covered low-level and high-level tasks, with language and non-language interfaces, across video, audio, and text modalities.<n>The additional track covered hour-long video understanding and introduced a novel video QA benchmark 1h-walk VQA.
arXiv Detail & Related papers (2024-11-29T18:57:25Z) - PVUW 2024 Challenge on Complex Video Understanding: Methods and Results [199.5593316907284]
We add two new tracks, Complex Video Object Track based on MOSE dataset and Motion Expression guided Video track based on MeViS dataset.
In the two new tracks, we provide additional videos and annotations that feature challenging elements.
These new videos, sentences, and annotations enable us to foster the development of a more comprehensive and robust pixel-level understanding of video scenes.
arXiv Detail & Related papers (2024-06-24T17:38:58Z) - 1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation [81.50620771207329]
We investigate the effectiveness of static-dominant data and frame sampling on referring video object segmentation (RVOS)
Our solution achieves a J&F score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge.
arXiv Detail & Related papers (2024-06-11T08:05:26Z) - SoccerNet 2023 Tracking Challenge -- 3rd place MOT4MOT Team Technical
Report [0.552480439325792]
The SoccerNet 2023 tracking challenge requires the detection and tracking of soccer players and the ball.
We employ a state-of-the-art online multi-object tracker and a contemporary object detector for player tracking.
Our method achieves 3rd place on the SoccerNet 2023 tracking challenge with a HOTA score of 66.27.
arXiv Detail & Related papers (2023-08-31T11:51:16Z) - GroundNLQ @ Ego4D Natural Language Queries Challenge 2023 [73.12670280220992]
To accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required.
We leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations.
In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module.
arXiv Detail & Related papers (2023-06-27T07:27:52Z) - Multiple Object Tracking Challenge Technical Report for Team MT_IoT [41.88133094982688]
We treat the MOT task as a two-stage task including human detection and trajectory matching.
Specifically, we designed an improved human detector and associated most of detection to guarantee the integrity of the motion trajectory.
Without any model merging, our method achieves 66.672 HOTA and 93.971 MOTA on the DanceTrack challenge dataset.
arXiv Detail & Related papers (2022-12-07T12:00:51Z) - AIM 2020 Challenge on Video Temporal Super-Resolution [118.46127362093135]
Second AIM challenge on Video Temporal Super-Resolution (VTSR)
This paper reports the second AIM challenge on Video Temporal Super-Resolution (VTSR)
arXiv Detail & Related papers (2020-09-28T00:10:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.