Related papers: BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion

BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion

URL: http://arxiv.org/abs/2306.02680v1
Date: Mon, 5 Jun 2023 08:12:17 GMT
Title: BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion
Authors: Ahana Deb, Sayan Nag, Ayan Mahapatra, Soumitri Chattopadhyay, Aritra Marik, Pijush Kanti Gayen, Shankha Sanyal, Archi Banerjee, Samir Karmakar
Abstract summary: We develop a novel approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, to predict speech acts. We also show that our model BeAts ($underlinetextbfBe$ngali speech acts recognition using Multimodal $underlinetextbfAt$tention Fu$underlinetextbfs$ion.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spoken languages often utilise intonation, rhythm, intensity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expression to the language. Recent advancements in attention-based models, demonstrating their ability to learn powerful representations from multilingual datasets, have performed well in speech tasks and are ideal to model specific tasks in low resource languages. Here, we develop a novel multimodal approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, by using multimodal attention fusion to predict speech acts in our prepared Bengali speech corpus. We also show that our model BeAts ($\underline{\textbf{Be}}$ngali speech acts recognition using Multimodal $\underline{\textbf{At}}$tention Fu$\underline{\textbf{s}}$ion) significantly outperforms both the unimodal baseline using only speech data and a simpler bimodal fusion using both speech and text data. Project page: https://soumitri2001.github.io/BeAts

Related papers

Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation [48.769137497536]
We propose the unit language to overcome the two modeling challenges.<n>The unit language can be considered a text-like representation format.<n>We implement multi-task learning to utilize the unit language in guiding the speech modeling process.
arXiv Detail & Related papers (2025-05-21T10:05:25Z)
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction [73.39573341265027]
We introduce MinMo, a Multimodal Large Language Model for seamless voice interaction. We train MinMo through stages speechtext-to-speech alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction. After the multi-text training, MinMo state-of-the-art performance across various benchmarks for voice comprehension and generation.
arXiv Detail & Related papers (2025-01-10T15:55:27Z)
Seamless: Multilingual Expressive and Streaming Speech Translation [71.12826355107889]
We introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model- SeamlessM4T v2. We bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time.
arXiv Detail & Related papers (2023-12-08T17:18:42Z)
Toward Joint Language Modeling for Speech Units and Text [89.32163954508489]
We explore joint language modeling for speech units and text. We introduce automatic metrics to evaluate how well the joint LM mixes speech and text. Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks.
arXiv Detail & Related papers (2023-10-12T20:53:39Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
Textless Speech-to-Speech Translation With Limited Parallel Data [51.3588490789084]
PFB is a framework for training textless S2ST models that require just dozens of hours of parallel speech data. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z)
MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model. It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone. We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z)
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities [39.07096632751864]
SpeechGPT is a large language model with intrinsic cross-modal conversational abilities. We employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning.
arXiv Detail & Related papers (2023-05-18T14:23:25Z)
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes. Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z)
mSLAM: Massively multilingual joint pre-training for speech and text [43.32334037420761]
mSLAM learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. We find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID.
arXiv Detail & Related papers (2022-02-03T02:26:40Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.