Improved Long-Form Speech Recognition by Jointly Modeling the Primary
and Non-primary Speakers
- URL: http://arxiv.org/abs/2312.11123v1
- Date: Mon, 18 Dec 2023 11:47:39 GMT
- Title: Improved Long-Form Speech Recognition by Jointly Modeling the Primary
and Non-primary Speakers
- Authors: Guru Prakash Arumugam, Shuo-yiin Chang, Tara N. Sainath, Rohit
Prabhavalkar, Quan Wang, Shaan Bijwadia
- Abstract summary: We introduce a novel technique to simultaneously model different groups of speakers in the audio along with the standard transcript tokens.
Speakers are grouped as primary and non-primary, which connects the application domains.
This improved model neither needs any additional training data nor incurs additional training or inference cost.
- Score: 35.32552447347255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: ASR models often suffer from a long-form deletion problem where the model
predicts sequential blanks instead of words when transcribing a lengthy audio
(in the order of minutes or hours). From the perspective of a user or
downstream system consuming the ASR results, this behavior can be perceived as
the model "being stuck", and potentially make the product hard to use. One of
the culprits for long-form deletion is training-test data mismatch, which can
happen even when the model is trained on diverse and large-scale data collected
from multiple application domains. In this work, we introduce a novel technique
to simultaneously model different groups of speakers in the audio along with
the standard transcript tokens. Speakers are grouped as primary and
non-primary, which connects the application domains and significantly
alleviates the long-form deletion problem. This improved model neither needs
any additional training data nor incurs additional training or inference cost.
Related papers
- Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints.
We empirically find that this training paradigm limits the one-step generation performance of consistency models.
We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z) - DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models [55.608981341747246]
We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss.
Our analysis on the adaptivity of DAISY shows that the model exits early (using fewer layers) on clean data while exits late (using more layers) on noisy data.
arXiv Detail & Related papers (2024-06-08T12:58:13Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - Incomplete Utterance Rewriting as Sequential Greedy Tagging [0.0]
We introduce speaker-aware embedding to model speaker variation.
Our model achieves optimal results on all nine restoration scores while having other metric scores comparable to previous state-of-the-art models.
arXiv Detail & Related papers (2023-07-08T04:05:04Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Detect Hate Speech in Unseen Domains using Multi-Task Learning: A Case
Study of Political Public Figures [7.52579126252489]
We propose a new Multi-task Learning pipeline that utilizes MTL to train simultaneously across multiple hate speech datasets.
We show strong results when examining generalization error in train-test splits and substantial improvements when predicting on previously unseen datasets.
We also assemble a novel dataset, dubbed PubFigs, focusing on the problematic speech of American Public Political Figures.
arXiv Detail & Related papers (2022-08-22T21:13:38Z) - Directed Speech Separation for Automatic Speech Recognition of Long Form
Conversational Speech [10.291482850329892]
We propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal.
We achieve significant improvements on Word error rate (WER) for real conversational data without the need for an additional re-stitching step.
arXiv Detail & Related papers (2021-12-10T23:07:48Z) - Generative Text Modeling through Short Run Inference [47.73892773331617]
The present work proposes a short run dynamics for inference. It is variation from the prior distribution of the latent variable and then runs a small number of Langevin dynamics steps guided by its posterior distribution.
We show that the models trained with short run dynamics more accurately model the data, compared to strong language model and VAE baselines, and exhibit no sign of posterior collapse.
arXiv Detail & Related papers (2021-05-27T09:14:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.