ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking
- URL: http://arxiv.org/abs/2504.09195v1
- Date: Sat, 12 Apr 2025 12:33:15 GMT
- Title: ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking
- Authors: Tzoulio Chamiti, Leandro Di Bella, Adrian Munteanu, Nikos Deligiannis,
- Abstract summary: ReferGPT is a novel zero-shot referring multi-object tracking framework.<n>We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions.<n>We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries.
- Score: 17.736434513456576
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Tracking multiple objects based on textual queries is a challenging task that requires linking language understanding with object association across frames. Previous works typically train the whole process end-to-end or integrate an additional referring text module into a multi-object tracker, but they both require supervised training and potentially struggle with generalization to open-set queries. In this work, we introduce ReferGPT, a novel zero-shot referring multi-object tracking framework. We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions. This enhances its descriptive capabilities and supports a more flexible referring vocabulary without training. We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries. Extensive experiments on Refer-KITTI, Refer-KITTIv2 and Refer-KITTI+ demonstrate that ReferGPT achieves competitive performance against trained methods, showcasing its robustness and zero-shot capabilities in autonomous driving. The codes are available on https://github.com/Tzoulio/ReferGPT
Related papers
- Cognitive Disentanglement for Referring Multi-Object Tracking [28.325814292139686]
We propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework.
CDRMT adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks.
Experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T15:21:54Z) - DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation [2.7624021966289605]
Few-shot semantic segmentation (FSS) aims to enable models to segment novel/unseen object classes using only a limited number of labeled examples.<n>We propose a novel framework that utilizes large language models (LLMs) to adapt general class semantic information to the query image.<n>Our framework achieves state-of-the-art performance-by a significant margin-demonstrating superior generalization to novel classes and robustness across diverse scenarios.
arXiv Detail & Related papers (2025-03-06T01:42:28Z) - ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension [71.03445074045092]
We propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens.
Our method unifies the prompt and answer of visual referential tasks without using additional syntax.
ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency.
arXiv Detail & Related papers (2024-06-17T08:39:16Z) - Bootstrapping Referring Multi-Object Tracking [14.46285727127232]
Referring multi-object tracking (RMOT) aims at detecting and tracking multiple objects following human instruction represented by a natural language expression.
Our key idea is to bootstrap the task of referring multi-object tracking by introducing discriminative language words.
arXiv Detail & Related papers (2024-06-07T16:02:10Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision [44.04243892727856]
Cross-lingual open domain question answering is a complex problem.
We show that CLQA can be addressed using a single encoder-decoder model.
We propose a self-supervised method based on exploiting the cross-lingual link structure within Wikipedia.
arXiv Detail & Related papers (2024-02-26T11:42:29Z) - Learning to Prompt with Text Only Supervision for Vision-Language Models [107.282881515667]
One branch of methods adapts CLIP by learning prompts using visual information.
An alternative approach resorts to training-free methods by generating class descriptions from large language models.
We propose to combine the strengths of both streams by learning prompts using only text data.
arXiv Detail & Related papers (2024-01-04T18:59:49Z) - Referring Multi-Object Tracking [78.63827591797124]
We propose a new and general referring understanding task, termed referring multi-object tracking (RMOT)
Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking.
To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos.
arXiv Detail & Related papers (2023-03-06T18:50:06Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.