Related papers: Thinking While Listening: Simple Test Time Scaling For Audio Classification

Thinking While Listening: Simple Test Time Scaling For Audio Classification

URL: http://arxiv.org/abs/2509.19676v1
Date: Wed, 24 Sep 2025 01:17:24 GMT
Title: Thinking While Listening: Simple Test Time Scaling For Audio Classification
Authors: Prateek Verma, Mert Pilanci,
Abstract summary: We propose a framework that enables neural models to "think while listening" to everyday sounds, thereby enhancing audio classification performance.<n>Motivated by recent advances in the reasoning capabilities of large language models, we address two central questions: (i) how can thinking be incorporated into existing audio classification pipelines to enable reasoning in the category space and improve performance, and (ii) can a new architecture be designed from the ground up to support both thinking and test-time scaling.
Score: 61.3564313676731
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a framework that enables neural models to "think while listening" to everyday sounds, thereby enhancing audio classification performance. Motivated by recent advances in the reasoning capabilities of large language models, we address two central questions: (i) how can thinking be incorporated into existing audio classification pipelines to enable reasoning in the category space and improve performance, and (ii) can a new architecture be designed from the ground up to support both thinking and test-time scaling? We demonstrate that in both settings, our models exhibit improved classification accuracy. Leveraging test-time scaling, we observe consistent gains as the number of sampled traces increases. Furthermore, we evaluate two open-source reasoning models, GPT-OSS-20B and Qwen3-14B, showing that while such models are capable of zero-shot reasoning, a lightweight approach--retraining only the embedding matrix of a frozen, smaller model like GPT-2--can surpass the performance of billion-parameter text-based reasoning models.

Related papers

ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models [102.4511331368587]
ARISE (Adaptive Resolution-aware Scaling Evaluation) is a novel metric designed to assess the test-time scaling effectiveness of large reasoning models.<n>We conduct comprehensive experiments evaluating state-of-the-art reasoning models across diverse domains.
arXiv Detail & Related papers (2025-10-07T15:10:51Z)
TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation [23.21984538472322]
We present the first general test-time scaling framework for visual auto-regressive ( VAR) models.<n>We propose clustering-based diversity search and resampling-based potential selection.<n>Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement.
arXiv Detail & Related papers (2025-07-24T16:04:55Z)
Incentivizing LLMs to Self-Verify Their Answers [22.387551134333084]
We propose a framework that incentivizes Large Language Models to self-verify their own answers.<n>We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B.<n> Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling.
arXiv Detail & Related papers (2025-06-02T06:54:29Z)
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format. Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
Bayesian Prompt Learning for Image-Language Model Generalization [64.50204877434878]
We use the regularization ability of Bayesian methods to frame prompt learning as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space.
arXiv Detail & Related papers (2022-10-05T17:05:56Z)
Self-supervised models of audio effectively explain human cortical responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system. We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z)
Layer-wise Analysis of a Self-supervised Speech Representation Model [26.727775920272205]
Self-supervised learning approaches have been successful for pre-training speech representation models. Not much has been studied about the type or extent of information encoded in the pre-trained representations themselves.
arXiv Detail & Related papers (2021-07-10T02:13:25Z)
Neural PLDA Modeling for End-to-End Speaker Verification [40.842070706362534]
We propose a neural network approach for backend modeling in speaker verification called the neural PLDA (NPLDA) In this paper, we extend this work to achieve joint optimization of the embedding neural network (x-vector network) with the NPLDA network in an end-to-end fashion. We show that the proposed E2E model improves significantly over the x-vector PLDA baseline speaker verification system.
arXiv Detail & Related papers (2020-08-11T05:54:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.