Mind the Gap! Static and Interactive Evaluations of Large Audio Models
- URL: http://arxiv.org/abs/2502.15919v1
- Date: Fri, 21 Feb 2025 20:29:02 GMT
- Title: Mind the Gap! Static and Interactive Evaluations of Large Audio Models
- Authors: Minzhi Li, William Barr Held, Michael J Ryan, Kunat Pipatanakul, Potsawee Manakul, Hao Zhu, Diyi Yang,
- Abstract summary: Large Audio Models (LAMs) are designed to power voice-native experiences.<n>This study introduces an interactive approach to evaluate LAMs and collect 7,500 LAM interactions from 484 participants.
- Score: 55.87220295533817
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As AI chatbots become ubiquitous, voice interaction presents a compelling way to enable rapid, high-bandwidth communication for both semantic and social signals. This has driven research into Large Audio Models (LAMs) to power voice-native experiences. However, aligning LAM development with user goals requires a clear understanding of user needs and preferences to establish reliable progress metrics. This study addresses these challenges by introducing an interactive approach to evaluate LAMs and collecting 7,500 LAM interactions from 484 participants. Through topic modeling of user queries, we identify primary use cases for audio interfaces. We then analyze user preference rankings and qualitative feedback to determine which models best align with user needs. Finally, we evaluate how static benchmarks predict interactive performance - our analysis reveals no individual benchmark strongly correlates with interactive results ($\tau \leq 0.33$ for all benchmarks). While combining multiple coarse-grained features yields modest predictive power ($R^2$=$0.30$), only two out of twenty datasets on spoken question answering and age prediction show significantly positive correlations. This suggests a clear need to develop LAM evaluations that better correlate with user preferences.
Related papers
- Multi-granularity Interest Retrieval and Refinement Network for Long-Term User Behavior Modeling in CTR Prediction [68.90783662117936]
Click-through Rate (CTR) prediction is crucial for online personalization platforms.<n>Recent advancements have shown that modeling rich user behaviors can significantly improve the performance of CTR prediction.<n>We propose Multi-granularity Interest Retrieval and Refinement Network (MIRRN)
arXiv Detail & Related papers (2024-11-22T15:29:05Z) - Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence [11.217656140423207]
ASAC aims to evaluate the overall speaking proficiency of an L2 speaker in a setting where an interlocutor interacts with one or more candidates.<n>We propose a hierarchical graph model that aptly incorporates both broad inter-response interactions and nuanced semantic information.<n>Extensive experimental results on the NICT-JLE benchmark dataset suggest that our proposed modeling approach can yield considerable improvements in prediction accuracy.
arXiv Detail & Related papers (2024-09-11T07:24:07Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - TETRIS: Towards Exploring the Robustness of Interactive Segmentation [39.1981941213761]
We propose a methodology for finding extreme user inputs by a direct optimization in a white-box adversarial attack on the interactive segmentation model.
We report the results of an extensive evaluation of dozens of models.
arXiv Detail & Related papers (2024-02-09T01:36:21Z) - Temporal Interest Network for User Response Prediction [33.95334897033641]
We propose a Temporal Interest Network (TIN) to capture the semantic-temporal correlation simultaneously between behaviors and the target.
TIN has been successfully deployed in production since October 2023, serving the WeChat Moments traffic.
arXiv Detail & Related papers (2023-08-15T05:48:44Z) - Improving Open-Domain Dialogue Evaluation with a Causal Inference Model [8.625569782672663]
Explicit satisfaction ratings can be elicited from users, but users often do not provide ratings when asked.
Post-hoc ratings by experts are an alternative, but these can be both expensive and complex to collect.
Here, we explore the creation of automated methods for predicting both expert and user ratings of open-domain dialogues.
arXiv Detail & Related papers (2023-01-31T02:31:42Z) - Perceptual Score: What Data Modalities Does Your Model Perceive? [73.75255606437808]
We introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features.
We find that recent, more accurate multi-modal models for visual question-answering tend to perceive the visual data less than their predecessors.
Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions.
arXiv Detail & Related papers (2021-10-27T12:19:56Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Large-scale Hybrid Approach for Predicting User Satisfaction with
Conversational Agents [28.668681892786264]
Measuring user satisfaction level is a challenging task, and a critical component in developing large-scale conversational agent systems.
Human annotation based approaches are easier to control, but hard to scale.
A novel alternative approach is to collect user's direct feedback via a feedback elicitation system embedded to the conversational agent system.
arXiv Detail & Related papers (2020-05-29T16:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.