Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer
- URL: http://arxiv.org/abs/2501.01936v1
- Date: Fri, 03 Jan 2025 18:19:12 GMT
- Title: Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer
- Authors: Vishal Sunder, Eric Fosler-Lussier,
- Abstract summary: We propose to improve end-to-end (E2E) spoken language understand (SLU) in an RNN transducer model (RNN-T)
Our proposed model is akin to an E2E differentiable cascaded model which performs ASR and SLU sequentially.
- Score: 11.362681035467121
- License:
- Abstract: In this paper, we propose to improve end-to-end (E2E) spoken language understand (SLU) in an RNN transducer model (RNN-T) by incorporating a joint self-conditioned CTC automatic speech recognition (ASR) objective. Our proposed model is akin to an E2E differentiable cascaded model which performs ASR and SLU sequentially and we ensure that the SLU task is conditioned on the ASR task by having CTC self conditioning. This novel joint modeling of ASR and SLU improves SLU performance significantly over just using SLU optimization. We further improve the performance by aligning the acoustic embeddings of this model with the semantically richer BERT model. Our proposed knowledge transfer strategy makes use of a bag-of-entity prediction layer on the aligned embeddings and the output of this is used to condition the RNN-T based SLU decoding. These techniques show significant improvement over several strong baselines and can perform at par with large models like Whisper with significantly fewer parameters.
Related papers
- CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing [27.828675312638296]
We introduce Condition-Aware Self-Supervised Learning Representation (CA-SSLR)
CA-S SLR improves the model's capabilities and demonstrates its generality on unseen tasks.
Experiments show that CA-S SLR reduces the number of trainable parameters, mitigates overfitting, and excels in under-resourced and unseen tasks.
arXiv Detail & Related papers (2024-12-05T18:51:10Z) - Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - Modality Confidence Aware Training for Robust End-to-End Spoken Language
Understanding [18.616202196061966]
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently.
This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR)
We propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses.
arXiv Detail & Related papers (2023-07-22T17:47:31Z) - Token-level Sequence Labeling for Spoken Language Understanding using
Compositional End-to-End Models [94.30953696090758]
We build compositional end-to-end spoken language understanding systems.
By relying on intermediate decoders trained for ASR, our end-to-end systems transform the input modality from speech to token-level representations.
Our models outperform both cascaded and direct end-to-end models on a labeling task of named entity recognition.
arXiv Detail & Related papers (2022-10-27T19:33:18Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - N-Best ASR Transformer: Enhancing SLU Performance using Multiple ASR
Hypotheses [0.0]
Spoken Language Understanding (SLU) parses speech into semantic structures like dialog acts and slots.
We show that our approach significantly outperforms the prior state-of-the-art when subjected to the low data regime.
arXiv Detail & Related papers (2021-06-11T17:29:00Z) - RNN Transducer Models For Spoken Language Understanding [49.07149742835825]
We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition systems.
In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models.
arXiv Detail & Related papers (2021-04-08T15:35:22Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.