Contextual Biasing of Language Models for Speech Recognition in
Goal-Oriented Conversational Agents
- URL: http://arxiv.org/abs/2103.10325v2
- Date: Fri, 19 Mar 2021 00:44:17 GMT
- Title: Contextual Biasing of Language Models for Speech Recognition in
Goal-Oriented Conversational Agents
- Authors: Ashish Shenoy, Sravan Bodapati, Katrin Kirchhoff
- Abstract summary: Goal-oriented conversational interfaces are designed to accomplish specific tasks.
We propose a new architecture that utilizes context embeddings derived from BERT on sample utterances provided during inference time.
Our experiments show a word error rate (WER) relative reduction of 7% over non-contextual utterance-level NLM rescorers on goal-oriented audio datasets.
- Score: 11.193867567895353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Goal-oriented conversational interfaces are designed to accomplish specific
tasks and typically have interactions that tend to span multiple turns adhering
to a pre-defined structure and a goal. However, conventional neural language
models (NLM) in Automatic Speech Recognition (ASR) systems are mostly trained
sentence-wise with limited context. In this paper, we explore different ways to
incorporate context into a LSTM based NLM in order to model long range
dependencies and improve speech recognition. Specifically, we use context carry
over across multiple turns and use lexical contextual cues such as system
dialog act from Natural Language Understanding (NLU) models and the user
provided structure of the chatbot. We also propose a new architecture that
utilizes context embeddings derived from BERT on sample utterances provided
during inference time. Our experiments show a word error rate (WER) relative
reduction of 7% over non-contextual utterance-level NLM rescorers on
goal-oriented audio datasets.
Related papers
- Intent-Aware Dialogue Generation and Multi-Task Contrastive Learning for Multi-Turn Intent Classification [6.459396785817196]
Chain-of-Intent generates intent-driven conversations through self-play.
MINT-CL is a framework for multi-turn intent classification using multi-task contrastive learning.
We release MINT-E, a multilingual, intent-aware multi-turn e-commerce dialogue corpus.
arXiv Detail & Related papers (2024-11-21T15:59:29Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Neural paraphrasing by automatically crawled and aligned sentence pairs [11.95795974003684]
The main obstacle toward neural-network-based paraphrasing is the lack of large datasets with aligned pairs of sentences and paraphrases.
We present a method for the automatic generation of large aligned corpora, that is based on the assumption that news and blog websites talk about the same events using different narrative styles.
We propose a similarity search procedure with linguistic constraints that, given a reference sentence, is able to locate the most similar candidate paraphrases out from millions of indexed sentences.
arXiv Detail & Related papers (2024-02-16T10:40:38Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Towards Relevance and Sequence Modeling in Language Recognition [39.547398348702025]
We propose a neural network framework utilizing short-sequence information in language recognition.
A new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task.
Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data.
arXiv Detail & Related papers (2020-04-02T18:31:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.