Related papers: Polynomial Mixing for Efficient Self-supervised Speech Encoders

Polynomial Mixing for Efficient Self-supervised Speech Encoders

URL: http://arxiv.org/abs/2603.00683v1
Date: Sat, 28 Feb 2026 14:45:55 GMT
Title: Polynomial Mixing for Efficient Self-supervised Speech Encoders
Authors: Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen,
Abstract summary: Polynomial Mixer (PoM) is a drop-in replacement for multi-head self-attention.<n>PoM achieves its performance on downstream speech recognition tasks.
Score: 50.58463928808225
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: State-of-the-art speech-to-text models typically employ Transformer-based encoders that model token dependencies via self-attention mechanisms. However, the quadratic complexity of self-attention in both memory and computation imposes significant constraints on scalability. In this work, we propose a novel token-mixing mechanism, the Polynomial Mixer (PoM), as a drop-in replacement for multi-head self-attention. PoM computes a polynomial representation of the input with linear complexity with respect to the input sequence length. We integrate PoM into a self-supervised speech representation learning framework based on BEST-RQ and evaluate its performance on downstream speech recognition tasks. Experimental results demonstrate that PoM achieves a competitive word error rate compared to full self-attention and other linear-complexity alternatives, offering an improved trade-off between performance and efficiency in time and memory.

Related papers

A Transformer Inspired AI-based MIMO receiver [0.5039813366558306]
The AttDet design combines model-based interpretability with data-driven flexibility.<n>We demonstrate through link-level simulations under 5G channel models and high-order, mixed QAM modulation and coding schemes.<n>AttDet can approach near-optimal BER/BLER performance while maintaining predictable, realistic complexity.
arXiv Detail & Related papers (2025-10-23T09:05:10Z)
From Attention to Atoms: Spectral Dictionary Learning for Fast, Interpretable Language Models [0.0]
We propose a novel spectral generative modeling framework for natural language processing that jointly learns a global time varying Fourier dictionary and per token mixing coefficients.<n>Our approach achieves competitive perplexity and generation quality on standard benchmarks such as WikiText2 and Penn Treebank.<n>We demonstrate that spectral dictionary models can achieve competitive performance compared to transformer baselines while significantly reducing inference latency and memory footprint.
arXiv Detail & Related papers (2025-04-29T13:24:42Z)
Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention [0.0]
Learnable Multi-Scale Wavelet Transformer (LMWT) is a novel architecture that replaces the standard dot-product self-attention.<n>We present the detailed mathematical formulation of the learnable Haar wavelet module and its integration into the transformer framework.<n>Our results indicate that the LMWT achieves competitive performance while offering substantial computational advantages.
arXiv Detail & Related papers (2025-04-08T22:16:54Z)
A Hybrid Transformer Architecture with a Quantized Self-Attention Mechanism Applied to Molecular Generation [0.0]
We propose a hybrid quantum-classical self-attention mechanism as part of a transformer decoder.<n>We show that the time complexity of the query-key dot product is reduced from $mathcalO(n2 d)$ in a classical model to $mathcalO(n2 d)$ in our quantum model.<n>This work provides a promising avenue for quantum-enhanced natural language processing (NLP)
arXiv Detail & Related papers (2025-02-26T15:15:01Z)
Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition [48.84506301960988]
Cued Speech (CS) is a pure visual coding method used by hearing-impaired people. automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text.
arXiv Detail & Related papers (2024-01-31T05:20:29Z)
Mapping of attention mechanisms to a generalized Potts model [50.91742043564049]
We show that training a neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method. We also compute the generalization error of self-attention in a model scenario analytically using the replica method.
arXiv Detail & Related papers (2023-04-14T16:32:56Z)
Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain. AFNO is based on a principled foundation of operator learning. It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z)
MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation [45.90599689005832]
Recently, our proposed recurrent neural network (RNN) based all deep learning minimum variance distortionless response (ADL-MVDR) beamformer method yielded superior performance over the conventional MVDR. We present a self-attentive RNN beamformer to further improve our previous RNN-based beamformer by leveraging on the powerful modeling capability of self-attention.
arXiv Detail & Related papers (2021-04-17T05:02:04Z)
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning. VAEs tend to ignore latent variables with a strong auto-regressive decoder. We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.