Related papers: Locality enhanced dynamic biasing and sampling strategies for contextual ASR

Locality enhanced dynamic biasing and sampling strategies for contextual ASR

URL: http://arxiv.org/abs/2401.13146v1
Date: Tue, 23 Jan 2024 23:46:01 GMT
Title: Locality enhanced dynamic biasing and sampling strategies for contextual ASR
Authors: Md Asif Jalal, Pablo Peso Parada, George Pavlidis, Vasileios Moschopoulos, Karthikeyan Saravanan, Chrysovalantis-Giorgos Kontoulis, Jisi Zhang, Anastasios Drosou, Gil Ho Lee, Jungin Lee, Seokyeong Jung
Abstract summary: Contextual biasing (CB) modules bias ASR model towards such contextually-relevant phrases. In this work we firstly analyse different sampling strategies to provide insights into the training of CB for ASR. Secondly, we introduce a neighbourhood attention (NA) that localizes self attention (SA) to the nearest neighbouring frames.
Score: 7.640373723875947
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic Speech Recognition (ASR) still face challenges when recognizing time-variant rare-phrases. Contextual biasing (CB) modules bias ASR model towards such contextually-relevant phrases. During training, a list of biasing phrases are selected from a large pool of phrases following a sampling strategy. In this work we firstly analyse different sampling strategies to provide insights into the training of CB for ASR with correlation plots between the bias embeddings among various training stages. Secondly, we introduce a neighbourhood attention (NA) that localizes self attention (SA) to the nearest neighbouring frames to further refine the CB output. The results show that this proposed approach provides on average a 25.84% relative WER improvement on LibriSpeech sets and rare-word evaluation compared to the baseline.

Related papers

Enhancing the Robustness of Contextual ASR to Varying Biasing Information Volumes Through Purified Semantic Correlation Joint Modeling [63.755562174967274]
Cross-attention is affected by variations in biasing information volume.<n>We propose a purified semantic correlation joint modeling (PSC-Joint) approach.<n> PSC-Joint achieves average relative F1 score improvements of up to 21.34% on AISHELL-1 and 28.46% on KeSpeech.
arXiv Detail & Related papers (2025-09-07T03:46:59Z)
AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation [113.75682363364004]
AURORA is a framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation.<n>AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
arXiv Detail & Related papers (2025-08-04T07:47:38Z)
Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z)
Cross-utterance ASR Rescoring with Graph-based Label Propagation [14.669201156515891]
We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information.
arXiv Detail & Related papers (2023-03-27T12:08:05Z)
A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings [53.120885867427305]
Three approaches are evaluated for speaker-attributed automatic speech recognition (SA-ASR) in a meeting scenario. The WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER) The TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.
arXiv Detail & Related papers (2022-03-31T06:39:14Z)
End-to-end contextual asr based on posterior distribution adaptation for hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model. Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns. We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z)
Small Changes Make Big Differences: Improving Multi-turn Response Selection \\in Dialogue Systems via Fine-Grained Contrastive Learning [27.914380392295815]
Retrieve-based dialogue response selection aims to find a proper response from a candidate set given a multi-turn context. We propose a novel textbfFine-textbfGrained textbfContrastive (FGC) learning method for the response selection task based on PLMs.
arXiv Detail & Related papers (2021-11-19T11:07:07Z)
Learning to Ask Conversational Questions by Optimizing Levenshtein Distance [83.53855889592734]
We introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimize the minimum Levenshtein distance (MLD) through explicit editing actions. RISE is able to pay attention to tokens that are related to conversational characteristics. Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-06-30T08:44:19Z)
Cross-sentence Neural Language Models for Conversational Speech Recognition [17.317583079824423]
We propose an effective cross-sentence neural LM approach that reranks the ASR N-best hypotheses of an upcoming sentence. We also explore to extract task-specific global topical information of the cross-sentence history.
arXiv Detail & Related papers (2021-06-13T05:30:16Z)
A bandit approach to curriculum generation for automatic speech recognition [7.008190762572486]
We present an approach to mitigate the lack of training data by employing Automated Curriculum Learning. The goal of the approach is to optimize the training sequence of mini-batches ranked by the level of difficulty. We test our approach on a truly low-resource language and show that the bandit framework has a good improvement over the baseline transfer-learning model.
arXiv Detail & Related papers (2021-02-06T20:32:10Z)
Self-supervised Text-independent Speaker Verification using Prototypical Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning. We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.