Cascaded Cross-Modal Transformer for Request and Complaint Detection
- URL: http://arxiv.org/abs/2307.15097v1
- Date: Thu, 27 Jul 2023 13:45:42 GMT
- Title: Cascaded Cross-Modal Transformer for Request and Complaint Detection
- Authors: Nicolae-Catalin Ristea and Radu Tudor Ionescu
- Abstract summary: We propose a novel cascaded cross-modal transformer (CCMT) that combines speech and text transcripts to detect customer requests and complaints in phone conversations.
Our approach leverages a multimodal paradigm by transcribing the speech using automatic speech recognition (ASR) models and translating the transcripts into different languages.
We apply our system to the Requests Sub-Challenge of the ACM Multimedia Computational 2023 Paralinguistics Challenge, reaching unweighted average recalls (UAR) of 65.41% and 85.87% for the complaint and request classes, respectively.
- Score: 31.359578768463752
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We propose a novel cascaded cross-modal transformer (CCMT) that combines
speech and text transcripts to detect customer requests and complaints in phone
conversations. Our approach leverages a multimodal paradigm by transcribing the
speech using automatic speech recognition (ASR) models and translating the
transcripts into different languages. Subsequently, we combine
language-specific BERT-based models with Wav2Vec2.0 audio features in a novel
cascaded cross-attention transformer model. We apply our system to the Requests
Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics
Challenge, reaching unweighted average recalls (UAR) of 65.41% and 85.87% for
the complaint and request classes, respectively.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Cascaded Cross-Modal Transformer for Audio-Textual Classification [30.643750999989233]
We propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models.
We thus obtain an audio-textual (multimodal) representation for each data sample.
We were declared the winning solution in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge.
arXiv Detail & Related papers (2024-01-15T10:18:08Z) - Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation [27.926862030684926]
We introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation.
Our approach combines pre-trained speech and text models through a specialized encoder and a modal-level mask input.
By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss.
arXiv Detail & Related papers (2023-10-22T11:57:33Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents.
Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity.
We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z) - The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z) - Investigation of Speaker-adaptation methods in Transformer based ASR [8.637110868126548]
This paper explores different ways of incorporating speaker information at the encoder input while training a transformer-based model to improve its speech recognition performance.
We present speaker information in the form of speaker embeddings for each of the speakers.
We obtain improvements in the word error rate over the baseline through our approach of integrating speaker embeddings into the model.
arXiv Detail & Related papers (2020-08-07T16:09:03Z) - SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech
Translation [12.292167129361825]
We propose autoencoding speaker conversion for training data augmentation in automatic speech translation.
This technique directly transforms an audio sequence, resulting in audio synthesized to resemble another speaker's voice.
Our method compares favorably to SpecAugment on English$to$French and English$to$Romanian automatic speech translation (AST) tasks.
arXiv Detail & Related papers (2020-02-27T16:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.