Related papers: OneTrack-M: A multitask approach to transformer-based MOT models

OneTrack-M: A multitask approach to transformer-based MOT models

URL: http://arxiv.org/abs/2502.04478v1
Date: Thu, 06 Feb 2025 20:02:06 GMT
Title: OneTrack-M: A multitask approach to transformer-based MOT models
Authors: Luiz C. S. de Araujo, Carlos M. S. Figueiredo,
Abstract summary: Multi-Object Tracking (MOT) is a critical problem in computer vision.<n>OneTrack-M is a transformer-based MOT model designed to enhance tracking computational efficiency and accuracy.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-Object Tracking (MOT) is a critical problem in computer vision, essential for understanding how objects move and interact in videos. This field faces significant challenges such as occlusions and complex environmental dynamics, impacting model accuracy and efficiency. While traditional approaches have relied on Convolutional Neural Networks (CNNs), introducing transformers has brought substantial advancements. This work introduces OneTrack-M, a transformer-based MOT model designed to enhance tracking computational efficiency and accuracy. Our approach simplifies the typical transformer-based architecture by eliminating the need for a decoder model for object detection and tracking. Instead, the encoder alone serves as the backbone for temporal data interpretation, significantly reducing processing time and increasing inference speed. Additionally, we employ innovative data pre-processing and multitask training techniques to address occlusion and diverse objective challenges within a single set of weights. Experimental results demonstrate that OneTrack-M achieves at least 25% faster inference times compared to state-of-the-art models in the literature while maintaining or improving tracking accuracy metrics. These improvements highlight the potential of the proposed solution for real-time applications such as autonomous vehicles, surveillance systems, and robotics, where rapid responses are crucial for system effectiveness.

Related papers

SETransformer: A Hybrid Attention-Based Architecture for Robust Human Activity Recognition [7.291558599547268]
Human Activity Recognition (HAR) using wearable sensor data has become a central task in mobile computing, healthcare, and human-computer interaction.<n>We propose SETransformer, a hybrid deep neural architecture that combines Transformer-based temporal modeling with channel-wise squeeze-and-excitation (SE) attention and a learnable temporal attention pooling mechanism.<n>We evaluate SETransformer on the WISDM dataset and demonstrate that it significantly outperforms conventional models including LSTM, GRU, BiLSTM, and CNN baselines.
arXiv Detail & Related papers (2025-05-25T23:39:34Z)
Estimating Vehicle Speed on Roadways Using RNNs and Transformers: A Video-based Approach [0.0]
This project explores the application of advanced machine learning models, specifically Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and Transformers, to the task of vehicle speed estimation using video data.
arXiv Detail & Related papers (2025-02-21T15:51:49Z)
3D Multi-Object Tracking with Semi-Supervised GRU-Kalman Filter [6.13623925528906]
3D Multi-Object Tracking (MOT) is essential for intelligent systems like autonomous driving and robotic sensing. We propose a GRU-based MOT method, which introduces a learnable Kalman filter into the motion module. This approach is able to learn object motion characteristics through data-driven learning, thereby avoiding the need for manual model design and model error.
arXiv Detail & Related papers (2024-11-13T08:34:07Z)
Optimizing Vision Transformers with Data-Free Knowledge Transfer [8.323741354066474]
Vision transformers (ViTs) have excelled in various computer vision tasks due to their superior ability to capture long-distance dependencies. We propose compressing large ViT models using Knowledge Distillation (KD), which is implemented data-free to circumvent limitations related to data availability.
arXiv Detail & Related papers (2024-08-12T07:03:35Z)
PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture [46.266960248570086]
This study tackles the quadratic complexity of the self-attention mechanism by introducing a complexity local attention mechanism for effective feature aggregation. We also introduce a parameter-free channel temperature adaptation mechanism that adaptively adjusts the attention weight distribution in each channel. We show that PointMT achieves performance comparable to state-of-the-art methods while maintaining an optimal balance between performance and accuracy.
arXiv Detail & Related papers (2024-08-10T10:16:03Z)
Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking. DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget. Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z)
PNAS-MOT: Multi-Modal Object Tracking with Pareto Neural Architecture Search [64.28335667655129]
Multiple object tracking is a critical task in autonomous driving. As tracking accuracy improves, neural networks become increasingly complex, posing challenges for their practical application in real driving scenarios due to the high level of latency. In this paper, we explore the use of the neural architecture search (NAS) methods to search for efficient architectures for tracking, aiming for low real-time latency while maintaining relatively high accuracy.
arXiv Detail & Related papers (2024-03-23T04:18:49Z)
Leveraging the Power of Data Augmentation for Transformer-based Tracking [64.46371987827312]
We propose two data augmentation methods customized for tracking. First, we optimize existing random cropping via a dynamic search radius mechanism and simulation for boundary samples. Second, we propose a token-level feature mixing augmentation strategy, which enables the model against challenges like background interference.
arXiv Detail & Related papers (2023-09-15T09:18:54Z)
Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z)
TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video. TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy. The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z)
A Unified Object Motion and Affinity Model for Online Multi-Object Tracking [127.5229859255719]
We propose a novel MOT framework that unifies object motion and affinity model into a single network, named UMA. UMA integrates single object tracking and metric learning into a unified triplet network by means of multi-task learning. We equip our model with a task-specific attention module, which is used to boost task-aware feature learning.
arXiv Detail & Related papers (2020-03-25T09:36:43Z)
AP-MTL: Attention Pruned Multi-task Learning Model for Real-time Instrument Detection and Segmentation in Robot-assisted Surgery [23.33984309289549]
Training a real-time robotic system for the detection and segmentation of high-resolution images provides a challenging problem with the limited computational resource. We develop a novel end-to-end trainable real-time Multi-Task Learning model with weight-shared encoder and task-aware detection and segmentation decoders. Our model significantly outperforms state-of-the-art segmentation and detection models, including best-performed models in the challenge.
arXiv Detail & Related papers (2020-03-10T14:24:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.