PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech
- URL: http://arxiv.org/abs/2512.23686v1
- Date: Mon, 29 Dec 2025 18:43:23 GMT
- Title: PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech
- Authors: Deepak Babu Piskala,
- Abstract summary: ProfASR-Bench is a professional-talk evaluation suite for high-stakes applications across finance, medicine, legal, and technology.<n>Each example pairs a natural-language prompt with an entity-rich target utterance, enabling controlled measurement of context-conditioned recognition.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Speech Recognition (ASR) in professional settings faces challenges that existing benchmarks underplay: dense domain terminology, formal register variation, and near-zero tolerance for critical entity errors. We present ProfASR-Bench, a professional-talk evaluation suite for high-stakes applications across finance, medicine, legal, and technology. Each example pairs a natural-language prompt (domain cue and/or speaker profile) with an entity-rich target utterance, enabling controlled measurement of context-conditioned recognition. The corpus supports conventional ASR metrics alongside entity-aware scores and slice-wise reporting by accent and gender. Using representative families Whisper (encoder-decoder ASR) and Qwen-Omni (audio language models) under matched no-context, profile, domain+profile, oracle, and adversarial conditions, we find a consistent pattern: lightweight textual context produces little to no change in average word error rate (WER), even with oracle prompts, and adversarial prompts do not reliably degrade performance. We term this the context-utilization gap (CUG): current systems are nominally promptable yet underuse readily available side information. ProfASR-Bench provides a standardized context ladder, entity- and slice-aware reporting with confidence intervals, and a reproducible testbed for comparing fusion strategies across model families. Dataset: https://huggingface.co/datasets/prdeepakbabu/ProfASR-Bench Code: https://github.com/prdeepakbabu/ProfASR-Bench
Related papers
- AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering [97.52852990265136]
We introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models.<n>We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks.
arXiv Detail & Related papers (2026-01-21T07:35:36Z) - WESR: Scaling and Evaluating Word-level Event-Speech Recognition [59.21814194620928]
Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying.<n>We develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types.<n>Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol.
arXiv Detail & Related papers (2026-01-08T02:23:21Z) - Influence Guided Context Selection for Effective Retrieval-Augmented Generation [23.188397777606095]
Retrieval-Augmented Generation (RAG) addresses large language model (LLM) hallucinations by grounding responses in external knowledge.<n>Existing approaches attempt to improve performance through context selection based on predefined context quality assessment metrics.<n>We reconceptualize context quality assessment as an inference-time data valuation problem and introduce the Contextual Influence Value (CI value)<n>This novel metric quantifies context quality by measuring the performance degradation when removing each context from the list.
arXiv Detail & Related papers (2025-09-21T07:19:09Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - Conflict-Aware Soft Prompting for Retrieval-Augmented Generation [13.671410389511498]
Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts.<n>RAG often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict.<n>We introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context assessor and a base LLM.<n>CARE effectively mitigates context-memory conflicts, leading to an average performance gain of 5.0% on QA and fact-checking benchmarks.
arXiv Detail & Related papers (2025-08-21T05:36:29Z) - ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark [28.28891500803133]
We propose ContextASR-Bench to assess the linguistic competence of Automatic Speech Recognition systems.<n>It encompasses up to 40,000 data entries with more than 300,000 named entities across over 10 domains.<n>Extensive evaluation shows LALMs outperform conventional ASR models by a large margin thanks to the strong world knowledge and context modeling of LLMs.
arXiv Detail & Related papers (2025-07-08T07:21:20Z) - Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability [0.0]
This study benchmarks seven Akan automatic speech recognition (ASR) models built on transformer architectures.<n>It shows distinct error behaviors between the Whisper and Wav2Vec2 architectures.<n>These findings highlight the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks.
arXiv Detail & Related papers (2025-07-03T08:01:26Z) - PICASO: Permutation-Invariant Context Composition with State Space Models [98.91198288025117]
State Space Models (SSMs) offer a promising solution by allowing a database of contexts to be mapped onto fixed-dimensional states.<n>We propose a simple mathematical relation derived from SSM dynamics to compose multiple states into one that efficiently approximates the effect of concatenating raw context tokens.<n>We evaluate our resulting method on WikiText and MSMARCO in both zero-shot and fine-tuned settings, and show that we can match the strongest performing baseline while enjoying on average 5.4x speedup.
arXiv Detail & Related papers (2025-02-24T19:48:00Z) - A Reality Check on Context Utilisation for Retrieval-Augmented Generation [44.54803681476863]
We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance.<n>The dataset is based on the task of automated claim verification, for which automated retrieval of real-world evidence is crucial.<n>We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results.
arXiv Detail & Related papers (2024-12-22T14:16:38Z) - SFR-RAG: Towards Contextually Faithful LLMs [57.666165819196486]
Retrieval Augmented Generation (RAG) is a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance.
We introduce SFR-RAG, a small LLM that is instruction-textual with an emphasis on context-grounded generation and hallucination.
We also present ConBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks.
arXiv Detail & Related papers (2024-09-16T01:08:18Z) - Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.