MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech
- URL: http://arxiv.org/abs/2005.00812v2
- Date: Tue, 12 May 2020 17:42:42 GMT
- Title: MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech
- Authors: Jakob D. Havtorn, Jan Latko, Joakim Edin, Lasse Borgholt, Lars
Maal{\o}e, Lorenzo Belgrano, Nicolai F. Jacobsen, Regitze Sdun, \v{Z}eljko
Agi\'c
- Abstract summary: We propose a novel approach to real-time sequence labeling in speech.
Our model treats speech and its own textual representation as two separate modalities or views.
We show significant gains of jointly learning from the two modalities when compared to text or audio only.
- Score: 4.384576489684272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address a challenging and practical task of labeling questions in speech
in real time during telephone calls to emergency medical services in English,
which embeds within a broader decision support system for emergency
call-takers. We propose a novel multimodal approach to real-time sequence
labeling in speech. Our model treats speech and its own textual representation
as two separate modalities or views, as it jointly learns from streamed audio
and its noisy transcription into text via automatic speech recognition. Our
results show significant gains of jointly learning from the two modalities when
compared to text or audio only, under adverse noise and limited volume of
training data. The results generalize to medical symptoms detection where we
observe a similar pattern of improvements with multimodal learning.
Related papers
- IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Multi-Modal Multi-Correlation Learning for Audio-Visual Speech
Separation [38.75352529988137]
We propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation.
We define two key correlations which are: (1) identity correlation (between timbre and facial attributes); (2) phonetic correlation.
For implementation, contrastive learning or adversarial training approach is applied to maximize these two correlations.
arXiv Detail & Related papers (2022-07-04T04:53:39Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - MAESTRO: Matched Speech Text Representations through Modality Matching [35.566604806335626]
Maestro is a self-supervised training method to unify representations learnt from speech and text modalities.
We establish a new state-of-the-art (SOTA) on VoxPopuli multilingual ASR with a 11% relative reduction in Word Error Rate (WER)
We establish a new state-of-the-art (SOTA) on CoVoST 2 with an improvement of 2.8 BLEU averaged over 21 languages.
arXiv Detail & Related papers (2022-04-07T12:48:16Z) - Silent versus modal multi-speaker speech recognition from ultrasound and
video [43.919073642794324]
We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips.
We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech.
We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing.
arXiv Detail & Related papers (2021-02-27T21:34:48Z) - Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic
Speech Synthesis [59.623780036359655]
Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators.
This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury.
We propose a solution to this problem based on the theory of multi-view learning.
arXiv Detail & Related papers (2020-12-30T15:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.