ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and
Multi-Task Language Modeling
- URL: http://arxiv.org/abs/2106.09532v1
- Date: Tue, 15 Jun 2021 21:27:34 GMT
- Title: ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and
Multi-Task Language Modeling
- Authors: Ashish Shenoy, Sravan Bodapati, Katrin Kirchhoff
- Abstract summary: Cross utterance contextual cues play an important role in disambiguating domain specific content words from speech.
In this paper, we investigate various techniques to improve contextualization, content word robustness and domain adaptation of a Transformer-XL neural language model (NLM)
Our best performing NLM rescorer results in a content WER reduction of 19.2% on e-commerce audio test set and a slot labeling F1 improvement of 6.4%.
- Score: 11.193867567895353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Speech Recognition (ASR) robustness toward slot entities are
critical in e-commerce voice assistants that involve monetary transactions and
purchases. Along with effective domain adaptation, it is intuitive that cross
utterance contextual cues play an important role in disambiguating domain
specific content words from speech. In this paper, we investigate various
techniques to improve contextualization, content word robustness and domain
adaptation of a Transformer-XL neural language model (NLM) to rescore ASR
N-best hypotheses. To improve contextualization, we utilize turn level dialogue
acts along with cross utterance context carry over. Additionally, to adapt our
domain-general NLM towards e-commerce on-the-fly, we use embeddings derived
from a finetuned masked LM on in-domain data. Finally, to improve robustness
towards in-domain content words, we propose a multi-task model that can jointly
perform content word detection and language modeling tasks. Compared to a
non-contextual LSTM LM baseline, our best performing NLM rescorer results in a
content WER reduction of 19.2% on e-commerce audio test set and a slot labeling
F1 improvement of 6.4%.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Unified Language-driven Zero-shot Domain Adaptation [55.64088594551629]
Unified Language-driven Zero-shot Domain Adaptation (ULDA) is a novel task setting.
It enables a single model to adapt to diverse target domains without explicit domain-ID knowledge.
arXiv Detail & Related papers (2024-04-10T16:44:11Z) - Retrieval Augmented End-to-End Spoken Dialog Models [20.896330994089283]
We apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal.
Inspired by the RAG (retrieval-augmented generation) paradigm, we propose a retrieval augmented SLM (ReSLM) that overcomes this weakness.
We evaluated ReSLM on speech MultiWoz task (DSTC-11 challenge), and found that this retrieval augmentation boosts model performance.
arXiv Detail & Related papers (2024-02-02T18:23:09Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - "What's The Context?" : Long Context NLM Adaptation for ASR Rescoring in
Conversational Agents [13.586996848831543]
We investigate various techniques to incorporate turn based context history into both recurrent (LSTM) and Transformer-XL based NLMs.
For recurrent based NLMs, we explore context carry over mechanism and feature based augmentation.
We adapt our contextual NLM towards user provided on-the-fly speech patterns by leveraging encodings from a large pre-trained masked language model.
arXiv Detail & Related papers (2021-04-21T00:15:21Z) - Meta-Context Transformers for Domain-Specific Response Generation [4.377737808397113]
We present DSRNet, a transformer-based model for dialogue response generation by reinforcing domain-specific attributes.
We study the use of DSRNet in a multi-turn multi-interlocutor environment for domain-specific response generation.
Our model shows significant improvement over the state-of-the-art for multi-turn dialogue systems supported by better BLEU and semantic similarity (BertScore) scores.
arXiv Detail & Related papers (2020-10-12T09:49:27Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Contextual RNN-T For Open Domain ASR [41.83409885125617]
End-to-end (E2E) systems for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR system into a single neural network.
This has some nice advantages, it limits the system to be trained using only paired audio and text.
Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names.
We propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words.
arXiv Detail & Related papers (2020-06-04T04:37:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.