Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end
Spoken Language Understanding
- URL: http://arxiv.org/abs/2011.09044v2
- Date: Thu, 15 Apr 2021 16:38:29 GMT
- Title: Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end
Spoken Language Understanding
- Authors: Bhuvan Agrawal, Markus M\"uller, Martin Radfar, Samridhi Choudhary,
Athanasios Mouchtaris, Siegfried Kunzmann
- Abstract summary: We treat an E2E system as a multi-modal model, with audio and text functioning as its two modalities.
We propose using different multi-modal losses to guide the acoustic embeddings to be closer to the text embeddings.
We train the CMLS model on two publicly available E2E datasets, across different cross-modal losses and show that our proposed triplet loss function achieves the best performance.
- Score: 14.752834813510702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) spoken language understanding (SLU) systems can infer the
semantics of a spoken utterance directly from an audio signal. However,
training an E2E system remains a challenge, largely due to the scarcity of
paired audio-semantics data. In this paper, we treat an E2E system as a
multi-modal model, with audio and text functioning as its two modalities, and
use a cross-modal latent space (CMLS) architecture, where a shared latent space
is learned between the `acoustic' and `text' embeddings. We propose using
different multi-modal losses to explicitly guide the acoustic embeddings to be
closer to the text embeddings, obtained from a semantically powerful
pre-trained BERT model. We train the CMLS model on two publicly available E2E
datasets, across different cross-modal losses and show that our proposed
triplet loss function achieves the best performance. It achieves a relative
improvement of 1.4% and 4% respectively over an E2E model without a cross-modal
space and a relative improvement of 0.7% and 1% over a previously published
CMLS model using $L_2$ loss. The gains are higher for a smaller, more
complicated E2E dataset, demonstrating the efficacy of using an efficient
cross-modal loss function, especially when there is limited E2E training data
available.
Related papers
- Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Leveraging Large Text Corpora for End-to-End Speech Summarization [58.673480990374635]
End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech.
We present two novel methods that leverage a large amount of external text summarization data for E2E SSum training.
arXiv Detail & Related papers (2023-03-02T05:19:49Z) - JEIT: Joint End-to-End Model and Internal Language Model Training for
Speech Recognition [63.38229762589485]
We propose a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM.
With 100B unpaired sentences, JEIT/CJJT improves rare-word recognition accuracy by up to 16.4% over a model trained without unpaired text.
arXiv Detail & Related papers (2023-02-16T21:07:38Z) - End-to-End Speech to Intent Prediction to improve E-commerce Customer
Support Voicebot in Hindi and English [0.0]
We discuss an end-to-end (E2E) S2I model for customer support voicebot task in a bilingual setting.
We show how we can solve E2E intent classification by leveraging a pre-trained automatic speech recognition (ASR) model with slight modification and fine-tuning on small annotated datasets.
arXiv Detail & Related papers (2022-10-26T18:29:44Z) - Contextual Density Ratio for Language Model Biasing of Sequence to
Sequence ASR Systems [2.4909170697740963]
We propose a contextual density ratio approach for both training a contextual aware E2E model and adapting the language model to named entities.
Our proposed technique achieves a relative improvement of up to 46.5% on the names over an E2E baseline without degrading the overall recognition accuracy of the whole test set.
arXiv Detail & Related papers (2022-06-29T13:12:46Z) - Consistent Training and Decoding For End-to-end Speech Recognition Using
Lattice-free MMI [67.13999010060057]
We propose a novel approach to integrate LF-MMI criterion into E2E ASR frameworks in both training and decoding stages.
Experiments suggest that the introduction of the LF-MMI criterion consistently leads to significant performance improvements.
arXiv Detail & Related papers (2021-12-05T07:30:17Z) - Have best of both worlds: two-pass hybrid and E2E cascading framework
for speech recognition [71.30167252138048]
Hybrid and end-to-end (E2E) systems have different error patterns in the speech recognition results.
This paper proposes a two-pass hybrid and E2E cascading (HEC) framework to combine the hybrid and E2E model.
We show that the proposed system achieves 8-10% relative word error rate reduction with respect to each individual system.
arXiv Detail & Related papers (2021-10-10T20:11:38Z) - Internal Language Model Adaptation with Text-Only Data for End-to-End
Speech Recognition [80.32546870220979]
We propose an internal LM adaptation (ILMA) of the E2E model using text-only data.
ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost.
Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction.
arXiv Detail & Related papers (2021-10-06T23:03:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.