BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion
- URL: http://arxiv.org/abs/2306.02680v1
- Date: Mon, 5 Jun 2023 08:12:17 GMT
- Title: BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion
- Authors: Ahana Deb, Sayan Nag, Ayan Mahapatra, Soumitri Chattopadhyay, Aritra
Marik, Pijush Kanti Gayen, Shankha Sanyal, Archi Banerjee, Samir Karmakar
- Abstract summary: We develop a novel approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, to predict speech acts.
We also show that our model BeAts ($underlinetextbfBe$ngali speech acts recognition using Multimodal $underlinetextbfAt$tention Fu$underlinetextbfs$ion.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spoken languages often utilise intonation, rhythm, intensity, and structure,
to communicate intention, which can be interpreted differently depending on the
rhythm of speech of their utterance. These speech acts provide the foundation
of communication and are unique in expression to the language. Recent
advancements in attention-based models, demonstrating their ability to learn
powerful representations from multilingual datasets, have performed well in
speech tasks and are ideal to model specific tasks in low resource languages.
Here, we develop a novel multimodal approach combining two models, wav2vec2.0
for audio and MarianMT for text translation, by using multimodal attention
fusion to predict speech acts in our prepared Bengali speech corpus. We also
show that our model BeAts ($\underline{\textbf{Be}}$ngali speech acts
recognition using Multimodal $\underline{\textbf{At}}$tention
Fu$\underline{\textbf{s}}$ion) significantly outperforms both the unimodal
baseline using only speech data and a simpler bimodal fusion using both speech
and text data. Project page: https://soumitri2001.github.io/BeAts
Related papers
- Seamless: Multilingual Expressive and Streaming Speech Translation [71.12826355107889]
We introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion.
First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model- SeamlessM4T v2.
We bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time.
arXiv Detail & Related papers (2023-12-08T17:18:42Z) - Toward Joint Language Modeling for Speech Units and Text [89.32163954508489]
We explore joint language modeling for speech units and text.
We introduce automatic metrics to evaluate how well the joint LM mixes speech and text.
Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks.
arXiv Detail & Related papers (2023-10-12T20:53:39Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Textless Speech-to-Speech Translation With Limited Parallel Data [51.3588490789084]
PFB is a framework for training textless S2ST models that require just dozens of hours of parallel speech data.
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z) - MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low
Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model.
It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone.
We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z) - SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal
Conversational Abilities [39.07096632751864]
SpeechGPT is a large language model with intrinsic cross-modal conversational abilities.
We employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning.
arXiv Detail & Related papers (2023-05-18T14:23:25Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - mSLAM: Massively multilingual joint pre-training for speech and text [43.32334037420761]
mSLAM learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages.
We find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID.
arXiv Detail & Related papers (2022-02-03T02:26:40Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.