Related papers: Timbre-Adaptive Transcription: A Lightweight Architecture with Associative Memory for Dynamic Instrument Separation

Timbre-Adaptive Transcription: A Lightweight Architecture with Associative Memory for Dynamic Instrument Separation

URL: http://arxiv.org/abs/2509.12712v1
Date: Tue, 16 Sep 2025 06:05:36 GMT
Title: Timbre-Adaptive Transcription: A Lightweight Architecture with Associative Memory for Dynamic Instrument Separation
Authors: Ruigang Li, Yongxu Zhu,
Abstract summary: A timbre-agnostic backbone achieves state-of-the-art performance with only half the parameters of comparable models.<n>A novel associative memory mechanism mimics human auditory cognition to dynamically encode unseen timbres.
Score: 8.166820420083175
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments and rigid source-count constraints. We address these limitations with a lightweight deep clustering solution featuring: 1) a timbre-agnostic backbone achieving state-of-the-art performance with only half the parameters of comparable models, and 2) a novel associative memory mechanism that mimics human auditory cognition to dynamically encode unseen timbres via attention-based clustering. Our biologically-inspired framework enables adaptive polyphonic separation with minimal training data (12.5 minutes), supported by a new synthetic dataset method offering cost-effective, high-precision multi-timbre generation. Experiments show the timbre-agnostic transcription model outperforms existing models on public benchmarks, while the separation module demonstrates promising timbre discrimination. This work provides an efficient framework for timbre-related music transcription and explores new directions for timbre-aware separation through cognitive-inspired architectures.

Related papers

Domain-Incremental Continual Learning for Robust and Efficient Keyword Spotting in Resource Constrained Systems [0.0]
Keywords Spotting systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges.<n>We propose a comprehensive framework for continual learning designed to adapt to new domains while maintaining computational efficiency.<n>The proposed pipeline integrates a dual-input Convolutional Neural Network, utilizing both Mel Frequency Cepstral Coefficients (MFCC) and Mel-spectrogram features.
arXiv Detail & Related papers (2026-01-22T17:59:31Z)
SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models [53.19726629537694]
Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
arXiv Detail & Related papers (2025-12-17T14:28:23Z)
StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion [0.40105987447353786]
Stuttering detection breaks down when disfluencies overlap.<n>Existing parametric models struggle to distinguish complex, simultaneous disfluencies.<n>We introduce StutterFuse, the first Retrieval-Augmented generalization (RAC) for multi-label detection.
arXiv Detail & Related papers (2025-12-15T18:28:39Z)
Bridging Weakly-Supervised Learning and VLM Distillation: Noisy Partial Label Learning for Efficient Downstream Adaptation [51.67328507400985]
In noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators.<n>This paper focuses on learning from partial labels annotated by pre-trained vision-language models.<n>It proposes an innovative collaborative consistency regularization (Co-Reg) method.
arXiv Detail & Related papers (2025-06-03T12:48:54Z)
Modeling Time-Variant Responses of Optical Compressors with Selective State Space Models [0.0]
This paper presents a method for modeling optical dynamic range compressors using deep neural networks with Selective State Space models.<n>It features a refined technique integrating Feature-wise Linear Modulation and Gated Linear Units to adjust the network dynamically.<n>The proposed architecture is well-suited for low-latency and real-time applications, crucial in live audio processing.
arXiv Detail & Related papers (2024-08-22T17:03:08Z)
Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles [48.208214762257136]
It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side. To protect privacy, audio features are sent to the cloud instead of raw audio.
arXiv Detail & Related papers (2023-10-17T16:22:18Z)
Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription [19.228155694144995]
Timbre-Trap is a novel framework which unifies music transcription and audio reconstruction. We train a single autoencoder to simultaneously estimate pitch salience and reconstruct complex spectral coefficients. We demonstrate that the framework leads to performance comparable to state-of-the-art instrument-agnostic transcription methods.
arXiv Detail & Related papers (2023-09-27T15:19:05Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
Time-Frequency Scattering Accurately Models Auditory Similarities Between Instrumental Playing Techniques [5.923588533979649]
We show that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone. We propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques.
arXiv Detail & Related papers (2020-07-21T16:37:15Z)
Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features. We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution. We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z)
Automated and Formal Synthesis of Neural Barrier Certificates for Dynamical Models [70.70479436076238]
We introduce an automated, formal, counterexample-based approach to synthesise Barrier Certificates (BC) The approach is underpinned by an inductive framework, which manipulates a candidate BC structured as a neural network, and a sound verifier, which either certifies the candidate's validity or generates counter-examples. The outcomes show that we can synthesise sound BCs up to two orders of magnitude faster, with in particular a stark speedup on the verification engine.
arXiv Detail & Related papers (2020-07-07T07:39:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.