AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation
- URL: http://arxiv.org/abs/2412.10103v1
- Date: Fri, 13 Dec 2024 12:42:51 GMT
- Title: AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation
- Authors: Xiyuan Gao, Shubhi Bansal, Kushaan Gowda, Zhu Li, Shekhar Nayak, Nagendra Kumar, Matt Coler,
- Abstract summary: We present AMuSeD (Attentive deep neural network for MUltimodal Sarcasm dEtection incorporating bi-modal Data augmentation)
This approach utilizes the Multimodal Sarcasm Detection dataset (MUStARD) and introduces a two-phase bimodal data augmentation strategy.
The second phase involves the refinement of a FastSpeech 2-based speech synthesis system, tailored specifically for sarcasm to retain sarcastic intonations.
- Score: 11.568176591294746
- License:
- Abstract: Detecting sarcasm effectively requires a nuanced understanding of context, including vocal tones and facial expressions. The progression towards multimodal computational methods in sarcasm detection, however, faces challenges due to the scarcity of data. To address this, we present AMuSeD (Attentive deep neural network for MUltimodal Sarcasm dEtection incorporating bi-modal Data augmentation). This approach utilizes the Multimodal Sarcasm Detection Dataset (MUStARD) and introduces a two-phase bimodal data augmentation strategy. The first phase involves generating varied text samples through Back Translation from several secondary languages. The second phase involves the refinement of a FastSpeech 2-based speech synthesis system, tailored specifically for sarcasm to retain sarcastic intonations. Alongside a cloud-based Text-to-Speech (TTS) service, this Fine-tuned FastSpeech 2 system produces corresponding audio for the text augmentations. We also investigate various attention mechanisms for effectively merging text and audio data, finding self-attention to be the most efficient for bimodal integration. Our experiments reveal that this combined augmentation and attention approach achieves a significant F1-score of 81.0% in text-audio modalities, surpassing even models that use three modalities from the MUStARD dataset.
Related papers
- VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary Features [13.922091192207718]
Sarcasm recognition aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue.
We propose a novel approach that combines a lightweight depth attention module with a self-regulated ConvNet to concentrate on the most crucial features of visual data.
We have also conducted a cross-dataset analysis to test the adaptability of VyAnG-Net with unseen samples of another dataset MUStARD++.
arXiv Detail & Related papers (2024-08-05T15:36:52Z) - Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing.
Our method reduces the speech-text modality gap via a pre-processing stage.
We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - Speech-text based multi-modal training with bidirectional attention for
improved speech recognition [26.47071418582507]
We propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method.
BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space.
Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
arXiv Detail & Related papers (2022-11-01T08:25:11Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Multimodal Learning using Optimal Transport for Sarcasm and Humor
Detection [76.62550719834722]
We deal with multimodal sarcasm and humor detection from conversational videos and image-text pairs.
We propose a novel multimodal learning system, MuLOT, which utilizes self-attention to exploit intra-modal correspondence.
We test our approach for multimodal sarcasm and humor detection on three benchmark datasets.
arXiv Detail & Related papers (2021-10-21T07:51:56Z) - Multi-Modal Sarcasm Detection Based on Contrastive Attention Mechanism [7.194040730138362]
We construct a Contras-tive-Attention-based Sarcasm Detection (ConAttSD) model, which uses an inter-modality contrastive attention mechanism to extract contrastive features for an utterance.
Our experiments on MUStARD, a benchmark multi-modal sarcasm dataset, demonstrate the effectiveness of the proposed ConAttSD model.
arXiv Detail & Related papers (2021-09-30T14:17:51Z) - FiLMing Multimodal Sarcasm Detection with Attention [0.7340017786387767]
Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning.
We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes.
Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal detection dataset.
arXiv Detail & Related papers (2021-08-09T06:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.