Related papers: WESR: Scaling and Evaluating Word-level Event-Speech Recognition

WESR: Scaling and Evaluating Word-level Event-Speech Recognition

URL: http://arxiv.org/abs/2601.04508v1
Date: Thu, 08 Jan 2026 02:23:21 GMT
Title: WESR: Scaling and Evaluating Word-level Event-Speech Recognition
Authors: Chenchen Yang, Kexin Huang, Liwei Fan, Qian Tu, Botian Jiang, Dong Zhang, Linqi Yin, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu,
Abstract summary: Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying.<n>We develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types.<n>Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol.
Score: 59.21814194620928
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal events remains a critical yet under-explored challenge. Current methods suffer from insufficient task definitions with limited category coverage and ambiguous temporal granularity. They also lack standardized evaluation frameworks, hindering the development of downstream applications. To bridge this gap, we first develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types. Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement for both discrete and continuous events. We also build a strong baseline by constructing a 1,700+ hour corpus, and train specialized models, surpassing both open-source audio-language models and commercial APIs while preserving ASR quality. We anticipate that WESR will serve as a foundational resource for future research in modeling rich, real-world auditory scenes.

Related papers

PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech [0.0]
ProfASR-Bench is a professional-talk evaluation suite for high-stakes applications across finance, medicine, legal, and technology.<n>Each example pairs a natural-language prompt with an entity-rich target utterance, enabling controlled measurement of context-conditioned recognition.
arXiv Detail & Related papers (2025-12-29T18:43:23Z)
Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries [18.147981850263708]
We propose a query-based framework for open-vocabulary SED guided by multi-modal queries.<n>DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors from text or audio prompts.<n>DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting.
arXiv Detail & Related papers (2025-07-22T08:24:01Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.<n>We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.<n>In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding [1.07288078404291]
We propose a natural language understanding approach based on Automatic Speech Recognition (ASR) We improve a noisy-channel model to handle transcription inconsistencies caused by ASR errors. Experiments on four benchmark datasets show that Contrastive and Consistency Learning (CCL) outperforms existing methods.
arXiv Detail & Related papers (2024-05-23T23:10:23Z)
Double Mixture: Towards Continual Event Detection from Speech [60.33088725100812]
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events. This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events. We propose a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.
arXiv Detail & Related papers (2024-04-20T06:32:00Z)
Leveraging Language Model Capabilities for Sound Event Detection [10.792576135806623]
We propose an end-to-end framework for understanding audio features while simultaneously generating sound event and their temporal location. Specifically, we employ pretrained acoustic models to capture discriminative features across different categories and language models for autoregressive text generation.
arXiv Detail & Related papers (2023-08-22T15:59:06Z)
SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community. There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z)
Investigating self-supervised, weakly supervised and fully supervised training approaches for multi-domain automatic speech recognition: a study on Bangladeshi Bangla [4.869409466908974]
Speech recognition systems still suffer from a lack of robustness and generalizability issues due to domain shifting. In this study, we investigate the robustness of the state-of-the-art transfer learning approaches such as self-supervised wav2vec 2.0 and weakly supervised Whisper. We also demonstrate the significance of domain selection while building a corpus by assessing these models on a novel multi-domain Bangladeshi Bangla ASR benchmark.
arXiv Detail & Related papers (2022-10-24T02:18:03Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
Fast and Robust Unsupervised Contextual Biasing for Speech Recognition [16.557586847398778]
We propose an alternative approach that does not entail explicit contextual language model. We derive the bias score for every word in the system vocabulary from the training corpus. We show significant improvement in recognition accuracy when the relevant context is available.
arXiv Detail & Related papers (2020-05-04T17:29:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.