Abstract: Mel-filterbanks are fixed, engineered audio features which emulate human
perception and have been used through the history of audio understanding up to
today. However, their undeniable qualities are counterbalanced by the
fundamental limitations of handmade representations. In this work we show that
we can train a single learnable frontend that outperforms mel-filterbanks on a
wide range of audio signals, including speech, music, audio events and animal
sounds, providing a general-purpose learned frontend for audio classification.
To do so, we introduce a new principled, lightweight, fully learnable
architecture that can be used as a drop-in replacement of mel-filterbanks. Our
system learns all operations of audio features extraction, from filtering to
pooling, compression and normalization, and can be integrated into any neural
network at a negligible parameter cost. We perform multi-task training on eight
diverse audio classification tasks, and show consistent improvements of our
model over mel-filterbanks and previous learnable alternatives. Moreover, our
system outperforms the current state-of-the-art learnable frontend on Audioset,
with orders of magnitude fewer parameters.