Is Multiple Object Tracking a Matter of Specialization?
- URL: http://arxiv.org/abs/2411.00553v1
- Date: Fri, 01 Nov 2024 13:03:58 GMT
- Title: Is Multiple Object Tracking a Matter of Specialization?
- Authors: Gianluca Mancusi, Mattia Bernardi, Aniello Panariello, Angelo Porrello, Rita Cucchiara, Simone Calderara,
- Abstract summary: Training end-to-end transformer-based trackers in heterogeneous scenarios poses significant challenges.
We introduce.
The Scenario-specific Tracking Architecture (PASTA), a novel framework that combines.
Efficient Fine-Tuning (PEFT) and Modular Deep Learning.
- Score: 33.59920084936913
- License:
- Abstract: End-to-end transformer-based trackers have achieved remarkable performance on most human-related datasets. However, training these trackers in heterogeneous scenarios poses significant challenges, including negative interference - where the model learns conflicting scene-specific parameters - and limited domain generalization, which often necessitates expensive fine-tuning to adapt the models to new domains. In response to these challenges, we introduce Parameter-efficient Scenario-specific Tracking Architecture (PASTA), a novel framework that combines Parameter-Efficient Fine-Tuning (PEFT) and Modular Deep Learning (MDL). Specifically, we define key scenario attributes (e.g, camera-viewpoint, lighting condition) and train specialized PEFT modules for each attribute. These expert modules are combined in parameter space, enabling systematic generalization to new domains without increasing inference time. Extensive experiments on MOTSynth, along with zero-shot evaluations on MOT17 and PersonPath22 demonstrate that a neural tracker built from carefully selected modules surpasses its monolithic counterpart. We release models and code.
Related papers
- Towards Generalizable Trajectory Prediction Using Dual-Level Representation Learning And Adaptive Prompting [107.4034346788744]
Existing vehicle trajectory prediction models struggle with generalizability, prediction uncertainties, and handling complex interactions.
We propose Perceiver with Register queries (PerReg+), a novel trajectory prediction framework that introduces: (1) Dual-Level Representation Learning via Self-Distillation (SD) and Masked Reconstruction (MR), capturing global context and fine-grained details; (2) Enhanced Multimodality using register-based queries and pretraining, eliminating the need for clustering and suppression; and (3) Adaptive Prompt Tuning during fine-tuning, freezing the main architecture and optimizing a small number of prompts for efficient adaptation.
arXiv Detail & Related papers (2025-01-08T20:11:09Z) - SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.
It is designed to accurately detect horizontal or oriented objects from any sensor modality.
This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking [13.269416985959404]
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision.
We propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT)
We use historical embedding features to model the representation of ReID and detection features in a sequential order.
Our framework sets a new state-of-the-art performance in MOTA and IDF1 metrics.
arXiv Detail & Related papers (2024-09-17T14:34:18Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - eMoE-Tracker: Environmental MoE-based Transformer for Robust Event-guided Object Tracking [9.282504639411163]
This paper proposes a novel and effective Transformer-based event-guided tracking framework, called eMoE-Tracker.
Our key idea is to disentangle the environment into several learnable attributes to dynamically learn the attribute-specific features.
Experiments on diverse event-based benchmark datasets showcase the superior performance of our eMoE-Tracker compared to the prior arts.
arXiv Detail & Related papers (2024-06-28T16:13:55Z) - TIDE: Test Time Few Shot Object Detection [11.036762620105383]
Few-shot object detection (FSOD) aims to extract semantic knowledge from limited object instances of novel categories within a target domain.
Recent advances in FSOD focus on fine-tuning the base model based on a few objects via meta-learning or data augmentation.
We formalize a novel FSOD task, referred to as Test TIme Few Shot DEtection (TIDE), where the model is un-tuned in the configuration procedure.
arXiv Detail & Related papers (2023-11-30T09:00:44Z) - Single-Model and Any-Modality for Video Object Tracking [85.83753760853142]
We introduce Un-Track, a Unified Tracker of a single set of parameters for any modality.
To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques.
Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters.
arXiv Detail & Related papers (2023-11-27T14:17:41Z) - Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges.
Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning.
A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z) - Modular Multi Target Tracking Using LSTM Networks [0.0]
This paper proposes a model free end-to-end approach for airborne target tracking system using sensor measurements.
The proposed modular blocks can be independently trained and used in multitude of tracking applications.
arXiv Detail & Related papers (2020-11-16T15:58:49Z) - ePointDA: An End-to-End Simulation-to-Real Domain Adaptation Framework
for LiDAR Point Cloud Segmentation [111.56730703473411]
Training deep neural networks (DNNs) on LiDAR data requires large-scale point-wise annotations.
Simulation-to-real domain adaptation (SRDA) trains a DNN using unlimited synthetic data with automatically generated labels.
ePointDA consists of three modules: self-supervised dropout noise rendering, statistics-invariant and spatially-adaptive feature alignment, and transferable segmentation learning.
arXiv Detail & Related papers (2020-09-07T23:46:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.