Beat Tracking as Object Detection
- URL: http://arxiv.org/abs/2510.14391v2
- Date: Fri, 17 Oct 2025 00:38:35 GMT
- Title: Beat Tracking as Object Detection
- Authors: Jaehoon Ahn, Moon-Ryul Jung,
- Abstract summary: Recent beat and downbeat tracking models (e.g., RNNs, TCNs, Transformers) output frame-level activations.<n>We propose reframing this task as object detection, where beats and downbeats are modeled as temporal "objects"<n>Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat's temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns.<n>The model predicts overlapping beat/downbeat intervals with confidence scores, followed by non-maximum suppression (NMS) to select final predictions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent beat and downbeat tracking models (e.g., RNNs, TCNs, Transformers) output frame-level activations. We propose reframing this task as object detection, where beats and downbeats are modeled as temporal "objects." Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat's temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns. The model predicts overlapping beat/downbeat intervals with confidence scores, followed by non-maximum suppression (NMS) to select final predictions. This NMS step serves a similar role to DBNs in traditional trackers, but is simpler and less heuristic. Evaluated on standard music datasets, our approach achieves competitive results, showing that object detection techniques can effectively model musical beats with minimal adaptation.
Related papers
- StableTrack: Stabilizing Multi-Object Tracking on Low-Frequency Detections [0.18054741274903915]
Multi-object tracking (MOT) is one of the most challenging tasks in computer vision.<n>Current approaches mainly focus on tracking objects in each frame of a video stream.<n>We propose StableTrack, a novel approach that stabilizes the quality of tracking on low-frequency detections.
arXiv Detail & Related papers (2025-11-25T15:42:33Z) - Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer Architecture [2.8544822698499255]
This paper proposes an end-to-end transformer-based model for beat and downbeat tracking in performance MIDI.<n>Our approach introduces novel data preprocessing techniques, including dynamic augmentation and optimized tokenization strategies.<n>We conduct extensive experiments using the A-MAPS, ASAP, GuitarSet, and Leduc datasets, comparing our model against state-of-the-art hidden Markov models (HMMs) and deep learning-based beat tracking methods.
arXiv Detail & Related papers (2025-07-01T06:27:42Z) - Beat this! Accurate beat tracking without DBN postprocessing [4.440100868992127]
We propose a system for tracking beats and downbeats with two objectives: generality across a diverse music range, and high accuracy.
We achieve generality by training on multiple datasets, including solo instrument recordings, pieces with time signature changes, and classical music with high tempo variations.
For high accuracy, we develop a loss function tolerant to small time shifts of annotations, and an architecture alternating convolutions with transformers either over frequency or time.
arXiv Detail & Related papers (2024-07-31T14:59:17Z) - Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking.<n>DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget.<n>Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z) - DropMAE: Learning Representations via Masked Autoencoders with Spatial-Attention Dropout for Temporal Matching Tasks [77.84636815364905]
This paper studies masked autoencoder (MAE) video pre-training for various temporal matching-based downstream tasks.<n>We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
arXiv Detail & Related papers (2023-04-02T16:40:42Z) - DORT: Modeling Dynamic Objects in Recurrent for Multi-Camera 3D Object
Detection and Tracking [67.34803048690428]
We propose to model Dynamic Objects in RecurrenT (DORT) to tackle this problem.
DORT extracts object-wise local volumes for motion estimation that also alleviates the heavy computational burden.
It is flexible and practical that can be plugged into most camera-based 3D object detectors.
arXiv Detail & Related papers (2023-03-29T12:33:55Z) - Spatio-Temporal Point Process for Multiple Object Tracking [30.041104276095624]
Multiple Object Tracking (MOT) focuses on modeling the relationship of detected objects among consecutive frames and merge them into different trajectories.
We propose a novel framework that can effectively predict and mask-out noisy and confusing detection results before associating objects into trajectories.
arXiv Detail & Related papers (2023-02-05T18:14:08Z) - Minkowski Tracker: A Sparse Spatio-Temporal R-CNN for Joint Object
Detection and Tracking [53.64390261936975]
We present Minkowski Tracker, a sparse-temporal R-CNN that jointly solves object detection and tracking problems.
Inspired by region-based CNN (R-CNN), we propose to track motion as a second stage of the object detector R-CNN.
We show in large-scale experiments that the overall performance gain of our method is due to four factors.
arXiv Detail & Related papers (2022-08-22T04:47:40Z) - Neural Waveshaping Synthesis [0.0]
We present a novel, lightweight, fully causal approach to neural audio synthesis.
The Neural Waveshaping Unit (NEWT) operates directly in the waveform domain.
It produces complex timbral evolutions by simple affine transformations of its input and output signals.
arXiv Detail & Related papers (2021-07-11T13:50:59Z) - Object Detection Made Simpler by Eliminating Heuristic NMS [70.93004137521946]
We show a simple NMS-free, end-to-end object detection framework.
We attain on par or even improved detection accuracy compared with the original one-stage detector.
arXiv Detail & Related papers (2021-01-28T02:38:29Z) - ArTIST: Autoregressive Trajectory Inpainting and Scoring for Tracking [80.02322563402758]
One of the core components in online multiple object tracking (MOT) frameworks is associating new detections with existing tracklets.
We introduce a probabilistic autoregressive generative model to score tracklet proposals by directly measuring the likelihood that a tracklet represents natural motion.
arXiv Detail & Related papers (2020-04-16T06:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.