CIF-based Collaborative Decoding for End-to-end Contextual Speech
Recognition
- URL: http://arxiv.org/abs/2012.09466v2
- Date: Thu, 18 Feb 2021 07:42:44 GMT
- Title: CIF-based Collaborative Decoding for End-to-end Contextual Speech
Recognition
- Authors: Minglun Han and Linhao Dong and Shiyu Zhou and Bo Xu
- Abstract summary: We propose a continuous integrate-and-fire (CIF) based model that supports contextual biasing in a more controllable fashion.
An extra context processing network is introduced to extract contextual embeddings, integrate acoustically relevant context information and decode the contextual output distribution.
Our method brings relative character error rate (CER) reduction of 8.83%/21.13% and relative named entity character error rate (NE-CER) reduction of 40.14%/51.50% when compared with a strong baseline.
- Score: 14.815422751109061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) models have achieved promising results on multiple speech
recognition benchmarks, and shown the potential to become the mainstream.
However, the unified structure and the E2E training hamper injecting contextual
information into them for contextual biasing. Though contextual LAS (CLAS)
gives an excellent all-neural solution, the degree of biasing to given context
information is not explicitly controllable. In this paper, we focus on
incorporating context information into the continuous integrate-and-fire (CIF)
based model that supports contextual biasing in a more controllable fashion.
Specifically, an extra context processing network is introduced to extract
contextual embeddings, integrate acoustically relevant context information and
decode the contextual output distribution, thus forming a collaborative
decoding with the decoder of the CIF-based model. Evaluated on the named entity
rich evaluation sets of HKUST/AISHELL-2, our method brings relative character
error rate (CER) reduction of 8.83%/21.13% and relative named entity character
error rate (NE-CER) reduction of 40.14%/51.50% when compared with a strong
baseline. Besides, it keeps the performance on original evaluation set without
degradation.
Related papers
- Deep CLAS: Deep Contextual Listen, Attend and Spell [18.716477027977525]
Contextual-LAS (CLAS) has been shown effective in improving Automatic Speech Recognition of rare words.
In this work, we propose deep CLAS to use contextual information better.
arXiv Detail & Related papers (2024-09-26T07:40:03Z) - Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation [27.057810339120664]
We propose two techniques to improve context-aware ASR models.
On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion.
On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
arXiv Detail & Related papers (2024-07-14T19:32:33Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Contextualization Distillation from Large Language Model for Knowledge
Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks.
Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments.
Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing.
We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context.
Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z) - Two Stage Contextual Word Filtering for Context bias in Unified
Streaming and Non-streaming Transducer [17.835882045443896]
It is difficult for an E2E ASR system to recognize words such as entities appearing infrequently in the training data.
We propose an efficient approach to obtain a high quality contextual list for a unified streaming/non-streaming based E2E model.
arXiv Detail & Related papers (2023-01-17T07:29:26Z) - Contextual information integration for stance detection via
cross-attention [59.662413798388485]
Stance detection deals with identifying an author's stance towards a target.
Most existing stance detection models are limited because they do not consider relevant contextual information.
We propose an approach to integrate contextual information as text.
arXiv Detail & Related papers (2022-11-03T15:04:29Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.