Related papers: AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch

AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch

URL: http://arxiv.org/abs/2510.19368v1
Date: Wed, 22 Oct 2025 08:41:59 GMT
Title: AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch
Authors: Weichuang Shao, Iman Yi Liao, Tomas Henrique Bode Maul, Tissa Chandesa,
Abstract summary: This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT)<n>AMAuT eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths.<n> Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8%.
Score: 0.3728263002609659
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.

Related papers

Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
Omni2Sound: Towards Unified Video-Text-to-Audio Generation [56.11583645408007]
Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A) and joint video-text-to-audio (VT2A) generation offers significant application flexibility.<n>SoundAtlas is a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality.<n>We propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities.
arXiv Detail & Related papers (2026-01-06T05:49:41Z)
PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation [57.864929968616586]
Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions.<n>We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning.
arXiv Detail & Related papers (2025-11-24T07:11:12Z)
Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data [4.736913024290765]
Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark.<n>Our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters.
arXiv Detail & Related papers (2025-09-09T09:01:01Z)
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model [84.25283710008785]
VITA-Audio is an end-to-end large speech model with fast audio-text token generation.<n>MCTP module efficiently generates multiple audio tokens within a single model forward pass.<n>Four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality.
arXiv Detail & Related papers (2025-05-06T17:59:53Z)
ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions [15.472819870523093]
Transformer-based models, such as the Audio Spectrogram Transformers (AST), inherit the fixed-size input paradigm from CNNs. This paper introduces an approach that enables the use of variable-length audio inputs with AST models during both training and inference.
arXiv Detail & Related papers (2024-07-11T17:29:56Z)
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.<n>We propose Frieren, a V2A model based on rectified flow matching.<n>Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets. Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z)
Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z)
Joint Audio/Text Training for Transformer Rescorer of Streaming Speech Recognition [13.542483062256109]
We present our Joint Audio/Text training method for Transformer Rescorer. Our training method can improve word error rate (WER) significantly compared to standard Transformer Rescorer.
arXiv Detail & Related papers (2022-10-31T22:38:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.