Related papers: DRASP: A Dual-Resolution Attentive Statistics Pooling Framework for Automatic MOS Prediction

DRASP: A Dual-Resolution Attentive Statistics Pooling Framework for Automatic MOS Prediction

URL: http://arxiv.org/abs/2508.21407v1
Date: Fri, 29 Aug 2025 08:27:17 GMT
Title: DRASP: A Dual-Resolution Attentive Statistics Pooling Framework for Automatic MOS Prediction
Authors: Cheng-Yeh Yang, Kuan-Tang Huang, Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen,
Abstract summary: We introduce the Dual-Resolution Attentive Statistics Pooling (DRASP) framework.<n>DRASP integrates both coarse-grained, global statistical summaries and fine-grained, attentive analyses of perceptually significant segments.<n>It consistently outperforms various baseline methods across diverse datasets.
Score: 21.20778568616635
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A pooling mechanism is essential for mean opinion score (MOS) prediction, facilitating the transformation of variable-length audio features into a concise fixed-size representation that effectively encodes speech quality. Existing pooling methods typically operate at a singular granularity, concentrating either on a comprehensive global perspective or a detailed frame-level analysis, which may overlook complementary perceptual insights. To address this limitation, we introduce the Dual-Resolution Attentive Statistics Pooling (DRASP) framework. DRASP integrates both coarse-grained, global statistical summaries and fine-grained, attentive analyses of perceptually significant segments. This dual-view architecture empowers our model to formulate a more thorough and robust representation, capturing both the overarching structural context and salient local details concurrently. Extensive experiments validate the effectiveness and strong generalization ability of the proposed framework. It consistently outperforms various baseline methods across diverse datasets (MusicEval and AES-Natural), MOS prediction backbones (including a CLAP-based model and AudioBox-Aesthetics), and different audio generation systems, achieving a relative improvement of 10.39% in system-level Spearman's rank correlation coefficient (SRCC) over the widely-used average pooling approach.

Related papers

StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z)
Multi-source Heterogeneous Public Opinion Analysis via Collaborative Reasoning and Adaptive Fusion: A Systematically Integrated Approach [4.28787537081191]
This paper introduces a novel Collaborative Reasoning and Adaptive Fusion (CRAF) framework.<n>CRAF integrates traditional feature-based methods with large language models (LLMs) through a structured multi-stage reasoning mechanism.<n>The framework exhibits strong cross-platform adaptability, reducing the labeled data requirement for new platforms by 75%.
arXiv Detail & Related papers (2026-01-25T11:07:31Z)
Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization [63.169050703903515]
We propose Aes-R1, a comprehensive aesthetic reasoning framework with reinforcement learning (RL)<n>Aes-R1 integrates a pipeline, AesCoT, to construct and filter high-quality chain-of-thought aesthetic reasoning data.<n>Experiments demonstrate that Aes-R1 improves the backbone's average PLCC/SRCC by 47.9%/34.8%.
arXiv Detail & Related papers (2025-09-26T04:55:00Z)
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources [113.33902847941941]
Variance-Aware Sampling (VAS) is a data selection strategy guided by Variance Promotion Score (VPS)<n>We release large-scale, carefully curated resources containing 1.6M long CoT cold-start data and 15k RL QA pairs.<n> Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS.
arXiv Detail & Related papers (2025-09-25T14:58:29Z)
BTW: A Non-Parametric Variance Stabilization Framework for Multimodal Model Integration [20.600001069987318]
We propose Beyond Two-modality Weighting (BTW) to dynamically adjust modality importance during training.<n>BTW computes per-example KL weights by measuring the divergence between each unimodal and the current multimodal prediction.<n>Our method significantly improves regression performance and multiclass classification accuracy.
arXiv Detail & Related papers (2025-08-25T23:00:38Z)
Foundation Models for Demand Forecasting via Dual-Strategy Ensembling [11.926658499983446]
We propose a unified ensemble framework that enhances the performance of foundation models for sales forecasting in real-world supply chains.<n>Our method combines two complementary strategies: (1) Hierarchical Ensemble (HE), which partitions training and inference by semantic levels to capture localized patterns; and (2) Architectural Ensemble (AE), which integrates predictions from diverse model backbones to mitigate bias and improve stability.
arXiv Detail & Related papers (2025-07-29T17:56:38Z)
Iterative Augmentation with Summarization Refinement (IASR) Evaluation for Unstructured Survey data Modeling and Analysis [0.43988112145759295]
This work introduces a principled evaluation framework for large language model (LLM) based text augmentation.<n> Empirical evaluations show that GPT-3.5 Turbo achieved the best balance of semantic fidelity, diversity, and generation efficiency.
arXiv Detail & Related papers (2025-07-16T10:49:30Z)
CRIA: A Cross-View Interaction and Instance-Adapted Pre-training Framework for Generalizable EEG Representations [52.251569042852815]
CRIA is an adaptive framework that utilizes variable-length and variable-channel coding to achieve a unified representation of EEG data across different datasets.<n>The model employs a cross-attention mechanism to fuse temporal, spectral, and spatial features effectively.<n> Experimental results on the Temple University EEG corpus and the CHB-MIT dataset show that CRIA outperforms existing methods with the same pre-training conditions.
arXiv Detail & Related papers (2025-06-19T06:31:08Z)
Denoising and Alignment: Rethinking Domain Generalization for Multimodal Face Anti-Spoofing [47.24147617685829]
Face Anti-Spoofing (FAS) is essential for the security of facial recognition systems in diverse scenarios.<n>We introduce the textbfMultitextbfmodal textbfDenoising and textbfAlignment (textbfMMDA) framework.<n>By leveraging the zero-shot generalization capability of CLIP, the MMDA framework effectively suppresses noise in multimodal data.
arXiv Detail & Related papers (2025-05-14T15:36:44Z)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z)
Unified Enhancement of the Generalization and Robustness of Language Models via Bi-Stage Optimization [2.502393972789905]
We propose a bi-stage optimization framework to uniformly enhance both the generalization and robustness of LMs.<n>We show that our method significantly improves the generalization and robustness of LMs compared to other existing methods.
arXiv Detail & Related papers (2025-03-19T13:50:36Z)
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence [83.15764564701706]
We propose a novel framework that performs vision-language alignment by integrating Cauchy-Schwarz divergence with mutual information.<n>We find that the CS divergence seamlessly addresses the InfoNCE's alignment-uniformity conflict and serves complementary roles with InfoNCE.<n> Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.
arXiv Detail & Related papers (2025-02-24T10:29:15Z)
A Simple and Generalist Approach for Panoptic Segmentation [57.94892855772925]
We propose a simple generalist framework based on a deep encoder - shallow decoder architecture with per-pixel prediction.<n>We show that this is due to imbalance during training and propose a novel method for reducing it.<n>Our method achieves panoptic quality (PQ) of 55.1 on the challenging MS-COCO dataset.
arXiv Detail & Related papers (2024-08-29T13:02:12Z)
DualKanbaFormer: An Efficient Selective Sparse Framework for Multimodal Aspect-based Sentiment Analysis [0.6187939267100836]
We introduce DualKanbaFormer, a novel framework that leverages parallel Textual and Visual KanbaFormer modules for robust multimodal analysis.<n>Our approach incorporates Aspect-Driven Sparse Attention (ADSA) to balance coarse-grained aggregation and fine-grained selection for aspect-focused precision.<n>We replace traditional feed-forward networks and normalization with Kolmogorov-Arnold Networks (KANs) and Dynamic Tanh (DyT) to enhance non-linear expressivity and inference stability.
arXiv Detail & Related papers (2024-08-27T19:33:15Z)
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality. Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z)
DRFLM: Distributionally Robust Federated Learning with Inter-client Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data. We propose a general framework to solve the above two challenges simultaneously. We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.