DeTrack: In-model Latent Denoising Learning for Visual Object Tracking
- URL: http://arxiv.org/abs/2501.02467v1
- Date: Sun, 05 Jan 2025 07:28:50 GMT
- Title: DeTrack: In-model Latent Denoising Learning for Visual Object Tracking
- Authors: Xinyu Zhou, Jinglun Li, Lingyi Hong, Kaixun Jiang, Pinxue Guo, Weifeng Ge, Wenqiang Zhang,
- Abstract summary: We propose a new paradigm to formulate the visual object tracking problem as a denoising learning process.
Inspired by the diffusion model, denoising learning enhances the model's robustness to unseen data.
We introduce noise to bounding boxes, generating noisy boxes for training, thus enhancing model robustness on testing data.
- Score: 24.993508502786998
- License:
- Abstract: Previous visual object tracking methods employ image-feature regression models or coordinate autoregression models for bounding box prediction. Image-feature regression methods heavily depend on matching results and do not utilize positional prior, while the autoregressive approach can only be trained using bounding boxes available in the training set, potentially resulting in suboptimal performance during testing with unseen data. Inspired by the diffusion model, denoising learning enhances the model's robustness to unseen data. Therefore, We introduce noise to bounding boxes, generating noisy boxes for training, thus enhancing model robustness on testing data. We propose a new paradigm to formulate the visual object tracking problem as a denoising learning process. However, tracking algorithms are usually asked to run in real-time, directly applying the diffusion model to object tracking would severely impair tracking speed. Therefore, we decompose the denoising learning process into every denoising block within a model, not by running the model multiple times, and thus we summarize the proposed paradigm as an in-model latent denoising learning process. Specifically, we propose a denoising Vision Transformer (ViT), which is composed of multiple denoising blocks. In the denoising block, template and search embeddings are projected into every denoising block as conditions. A denoising block is responsible for removing the noise in a predicted bounding box, and multiple stacked denoising blocks cooperate to accomplish the whole denoising process. Subsequently, we utilize image features and trajectory information to refine the denoised bounding box. Besides, we also utilize trajectory memory and visual memory to improve tracking stability. Experimental results validate the effectiveness of our approach, achieving competitive performance on several challenging datasets.
Related papers
- ConsistencyDet: A Robust Object Detector with a Denoising Paradigm of Consistency Model [28.193325656555803]
We introduce a novel framework designed to articulate object detection as a denoising diffusion process.
This framework, termed ConsistencyDet, leverages an innovative denoising concept known as the Consistency Model.
We show that ConsistencyDet surpasses other leading-edge detectors in performance metrics.
arXiv Detail & Related papers (2024-04-11T14:08:45Z) - DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions [52.63323657077447]
We propose DNMOT, an end-to-end trainable DeNoising Transformer for multiple object tracking.
Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture.
We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-09-09T04:40:01Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Self2Self+: Single-Image Denoising with Self-Supervised Learning and
Image Quality Assessment Loss [4.035753155957699]
The proposed method achieves state-of-the-art denoising performance on both synthetic and real-world datasets.
This highlights the effectiveness and practicality of our method as a potential solution for various noise removal tasks.
arXiv Detail & Related papers (2023-07-20T08:38:01Z) - Masked Image Training for Generalizable Deep Image Denoising [53.03126421917465]
We present a novel approach to enhance the generalization performance of denoising networks.
Our method involves masking random pixels of the input image and reconstructing the missing information during training.
Our approach exhibits better generalization ability than other deep learning models and is directly applicable to real-world scenarios.
arXiv Detail & Related papers (2023-03-23T09:33:44Z) - Enhancing convolutional neural network generalizability via low-rank weight approximation [6.763245393373041]
Sufficient denoising is often an important first step for image processing.
Deep neural networks (DNNs) have been widely used for image denoising.
We introduce a new self-supervised framework for image denoising based on the Tucker low-rank tensor approximation.
arXiv Detail & Related papers (2022-09-26T14:11:05Z) - IDR: Self-Supervised Image Denoising via Iterative Data Refinement [66.5510583957863]
We present a practical unsupervised image denoising method to achieve state-of-the-art denoising performance.
Our method only requires single noisy images and a noise model, which is easily accessible in practical raw image denoising.
To evaluate raw image denoising performance in real-world applications, we build a high-quality raw image dataset SenseNoise-500 that contains 500 real-life scenes.
arXiv Detail & Related papers (2021-11-29T07:22:53Z) - Learning Spatial and Spatio-Temporal Pixel Aggregations for Image and
Video Denoising [104.59305271099967]
We present a pixel aggregation network and learn the pixel sampling and averaging strategies for image denoising.
We develop a pixel aggregation network for video denoising to sample pixels across the spatial-temporal space.
Our method is able to solve the misalignment issues caused by large motion in dynamic scenes.
arXiv Detail & Related papers (2021-01-26T13:00:46Z) - Self-Supervised Fast Adaptation for Denoising via Meta-Learning [28.057705167363327]
We propose a new denoising approach that can greatly outperform the state-of-the-art supervised denoising methods.
We show that the proposed method can be easily employed with state-of-the-art denoising networks without additional parameters.
arXiv Detail & Related papers (2020-01-09T09:40:53Z) - Variational Denoising Network: Toward Blind Noise Modeling and Removal [59.36166491196973]
Blind image denoising is an important yet very challenging problem in computer vision.
We propose a new variational inference method, which integrates both noise estimation and image denoising.
arXiv Detail & Related papers (2019-08-29T15:54:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.