Leveraging Cross-Utterance Context For ASR Decoding
- URL: http://arxiv.org/abs/2306.16903v1
- Date: Thu, 29 Jun 2023 12:48:25 GMT
- Title: Leveraging Cross-Utterance Context For ASR Decoding
- Authors: Robert Flynn and Anton Ragni
- Abstract summary: Cross utterance information has been shown to be beneficial during second pass re-scoring.
We investigate the incorporation of long-context transformer LMs for cross-utterance decoding of acoustic models via beam search.
- Score: 6.033324057680156
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While external language models (LMs) are often incorporated into the decoding
stage of automated speech recognition systems, these models usually operate
with limited context. Cross utterance information has been shown to be
beneficial during second pass re-scoring, however this limits the hypothesis
space based on the local information available to the first pass LM. In this
work, we investigate the incorporation of long-context transformer LMs for
cross-utterance decoding of acoustic models via beam search, and compare
against results from n-best rescoring. Results demonstrate that beam search
allows for an improved use of cross-utterance context. When evaluating on the
long-format dataset AMI, results show a 0.7\% and 0.3\% absolute reduction on
dev and test sets compared to the single-utterance setting, with improvements
when including up to 500 tokens of prior context. Evaluations are also provided
for Tedlium-1 with less significant improvements of around 0.1\% absolute.
Related papers
- Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models [16.920823078873095]
Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword.
We show on the real-world dataset of follow-up conversations that this approach yields large gains due to the joint modeling of the previous speech context and ASR uncertainty.
arXiv Detail & Related papers (2024-10-28T19:43:43Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - Towards Effective and Compact Contextual Representation for Conformer
Transducer Speech Recognition Systems [39.90886684053726]
This paper aims to derive a suitable compact representation of the most relevant history contexts.
Experiments on the 1000-hr Gigaspeech corpus demonstrate that the proposed contextualized streaming Conformer-Transducers outperform the baseline.
arXiv Detail & Related papers (2023-06-23T05:55:19Z) - A Reference-less Quality Metric for Automatic Speech Recognition via
Contrastive-Learning of a Multi-Language Model with Self-Supervision [0.20999222360659603]
This work proposes a referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions.
To estimate the quality of ASR hypotheses, a pre-trained language model (LM) is fine-tuned with contrastive learning in a self-supervised learning manner.
The proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments.
arXiv Detail & Related papers (2023-06-21T21:33:39Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Cross-Utterance Language Models with Acoustic Error Sampling [1.376408511310322]
Cross-utterance LM (CULM) is proposed to augment the input to a standard long short-term memory (LSTM) LM.
An acoustic error sampling technique is proposed to reduce the mismatch between training and test-time.
Experiments performed on both AMI and Switchboard datasets show that CULMs outperform the LSTM LM baseline WER.
arXiv Detail & Related papers (2020-08-19T17:40:11Z) - Multimodal Semi-supervised Learning Framework for Punctuation Prediction
in Conversational Speech [17.602098162338137]
We explore a multimodal semi-supervised learning approach for punctuation prediction.
We learn representations from large amounts of unlabelled audio and text data.
When trained on 1 hour of speech and text data, the proposed model achieved 9-18% absolute improvement over baseline model.
arXiv Detail & Related papers (2020-08-03T08:13:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.