Related papers: Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech, Music, and Environmental Sounds

Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech, Music, and Environmental Sounds

URL: http://arxiv.org/abs/2505.23509v1
Date: Thu, 29 May 2025 14:52:47 GMT
Title: Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech, Music, and Environmental Sounds
Authors: Andrew Chang, Yike Li, Iran R. Roman, David Poeppel,
Abstract summary: We propose a novel approach centered on spectrotemporal modulation (STM) features, a signal processing method that mimics the neurophysiological representation in the human auditory cortex.<n>The classification performance of our STM-based model, without any pretraining, is comparable to that of pretrained audio DNNs across diverse naturalistic speech, music, and environmental sounds.<n>These results show that STM is an efficient and interpretable feature representation for audio classification, advancing the development of machine listening and unlocking exciting new possibilities for basic understanding of speech and auditory sciences, as well as developing audio BCI and cognitive computing.
Score: 0.22499166814992438
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio DNNs have demonstrated impressive performance on various machine listening tasks; however, most of their representations are computationally costly and uninterpretable, leaving room for optimization. Here, we propose a novel approach centered on spectrotemporal modulation (STM) features, a signal processing method that mimics the neurophysiological representation in the human auditory cortex. The classification performance of our STM-based model, without any pretraining, is comparable to that of pretrained audio DNNs across diverse naturalistic speech, music, and environmental sounds, which are essential categories for both human cognition and machine perception. These results show that STM is an efficient and interpretable feature representation for audio classification, advancing the development of machine listening and unlocking exciting new possibilities for basic understanding of speech and auditory sciences, as well as developing audio BCI and cognitive computing.

Related papers

AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z)
MOSPA: Human Motion Generation Driven by Spatial Audio [56.735282455483954]
We introduce the first comprehensive Spatial Audio-Driven Human Motion dataset, which contains diverse and high-quality spatial audio and motion data.<n>We develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA.<n>Once trained, MOSPA could generate diverse realistic human motions conditioned on varying spatial audio inputs.
arXiv Detail & Related papers (2025-07-16T06:33:11Z)
Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture [0.0]
This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics.<n>The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets.
arXiv Detail & Related papers (2025-04-25T05:57:22Z)
Fundamental Survey on Neuromorphic Based Audio Classification [0.5530212768657544]
This survey provides an exhaustive examination of the current state-of-the-art in neuromorphic-based audio classification.<n>It delves into the crucial components of neuromorphic systems, such as Spiking Neural Networks (SNNs), memristors, and neuromorphic hardware platforms.<n>It examines how these approaches address the limitations of traditional audio classification methods, particularly in terms of energy efficiency, real-time processing, and robustness to environmental noise.
arXiv Detail & Related papers (2025-02-20T21:34:32Z)
Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI [20.432212333539628]
We introduce a novel coarse-to-fine audio reconstruction method based on functional Magnetic Resonance Imaging (fMRI) data. We validate our method on three public fMRI datasets-Brain2Sound, Brain2Music, and Brain2Speech. By employing semantic prompts during decoding, we enhance the quality of reconstructed audio when semantic features are suboptimal.
arXiv Detail & Related papers (2024-05-29T03:16:14Z)
BAT: Learning to Reason about Spatial Sounds with Large Language Models [45.757161909533714]
We present BAT, which combines the sound perception ability of a spatial scene analysis model with the natural language reasoning capabilities of a large language model (LLM)<n>Our experiments demonstrate BAT's superior performance on both spatial sound perception and reasoning.
arXiv Detail & Related papers (2024-02-02T17:34:53Z)
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal. To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z)
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.<n> Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z)
Self-supervised models of audio effectively explain human cortical responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system. We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z)
Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Our idea is to learn to dereverberate speech from audio-visual observations. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z)
Joint Blind Room Acoustic Characterization From Speech And Music Signals Using Convolutional Recurrent Neural Networks [13.12834490248018]
Reverberation time, clarity, and direct-to-reverberant ratio are acoustic parameters that have been defined to describe reverberant environments. Recent audio combined with machine learning suggests that one could estimate those parameters blindly using speech or music signals. We propose a robust end-to-end method to achieve blind joint acoustic parameter estimation using speech and/or music signals.
arXiv Detail & Related papers (2020-10-21T17:41:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.