Sarcasm in Sight and Sound: Benchmarking and Expansion to Improve
Multimodal Sarcasm Detection
- URL: http://arxiv.org/abs/2310.01430v1
- Date: Fri, 29 Sep 2023 07:00:41 GMT
- Title: Sarcasm in Sight and Sound: Benchmarking and Expansion to Improve
Multimodal Sarcasm Detection
- Authors: Swapnil Bhosale, Abhra Chaudhuri, Alex Lee Robert Williams, Divyank
Tiwari, Anjan Dutta, Xiatian Zhu, Pushpak Bhattacharyya, Diptesh Kanojia
- Abstract summary: We benchmark the MUStARD dataset with state-of-the-art language, speech, and visual encoders, for fully utilizing the totality of the multi-modal richness that it has to offer.
We propose an extension, which we call emphMUStARD++ Balanced, benchmarking the same with instances from the extension split across both train and test sets, achieving a further 2.4% macro-F1 boost.
- Score: 68.82684696740134
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The introduction of the MUStARD dataset, and its emotion recognition
extension MUStARD++, have identified sarcasm to be a multi-modal phenomenon --
expressed not only in natural language text, but also through manners of speech
(like tonality and intonation) and visual cues (facial expression). With this
work, we aim to perform a rigorous benchmarking of the MUStARD++ dataset by
considering state-of-the-art language, speech, and visual encoders, for fully
utilizing the totality of the multi-modal richness that it has to offer,
achieving a 2\% improvement in macro-F1 over the existing benchmark.
Additionally, to cure the imbalance in the `sarcasm type' category in
MUStARD++, we propose an extension, which we call \emph{MUStARD++ Balanced},
benchmarking the same with instances from the extension split across both train
and test sets, achieving a further 2.4\% macro-F1 boost. The new clips were
taken from a novel source -- the TV show, House MD, which adds to the diversity
of the dataset, and were manually annotated by multiple annotators with
substantial inter-annotator agreement in terms of Cohen's kappa and
Krippendorf's alpha. Our code, extended data, and SOTA benchmark models are
made public.
Related papers
- Towards Explainable Bilingual Multimodal Misinformation Detection and Localization [64.37162720126194]
BiMi is a framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis.<n>BiMiBench is a benchmark constructed by systematically editing real news images and subtitles.<n>BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore.
arXiv Detail & Related papers (2025-06-28T15:43:06Z) - MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion [44.45109614673675]
We create a search system that extracts text and features from both visual and audio modalities.
MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs.
arXiv Detail & Related papers (2025-03-26T16:28:04Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation [11.568176591294746]
We present AMuSeD (Attentive deep neural network for MUltimodal Sarcasm dEtection incorporating bi-modal Data augmentation)
This approach utilizes the Multimodal Sarcasm Detection dataset (MUStARD) and introduces a two-phase bimodal data augmentation strategy.
The second phase involves the refinement of a FastSpeech 2-based speech synthesis system, tailored specifically for sarcasm to retain sarcastic intonations.
arXiv Detail & Related papers (2024-12-13T12:42:51Z) - MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer [20.261021985218648]
We present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model.
Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting.
We achieve a sound balance between zero-shot and close-set video recognition tasks and obtain state-of-the-art or competitive results on various datasets.
arXiv Detail & Related papers (2024-10-14T15:00:55Z) - VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary Features [13.922091192207718]
Sarcasm recognition aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue.
We propose a novel approach that combines a lightweight depth attention module with a self-regulated ConvNet to concentrate on the most crucial features of visual data.
We have also conducted a cross-dataset analysis to test the adaptability of VyAnG-Net with unseen samples of another dataset MUStARD++.
arXiv Detail & Related papers (2024-08-05T15:36:52Z) - Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs.
We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets.
We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z) - Improving the Robustness of Summarization Systems with Dual Augmentation [68.53139002203118]
A robust summarization system should be able to capture the gist of the document, regardless of the specific word choices or noise in the input.
We first explore the summarization models' robustness against perturbations including word-level synonym substitution and noise.
We propose a SummAttacker, which is an efficient approach to generating adversarial samples based on language models.
arXiv Detail & Related papers (2023-06-01T19:04:17Z) - FF2: A Feature Fusion Two-Stream Framework for Punctuation Restoration [27.14686854704104]
We propose a Feature Fusion two-stream framework (FF2) for punctuation restoration.
Specifically, one stream leverages a pre-trained language model to capture the semantic feature, while another auxiliary module captures the feature at hand.
Without additional data, the experimental results on the popular benchmark IWSLT demonstrate that FF2 achieves new SOTA performance.
arXiv Detail & Related papers (2022-11-09T06:18:17Z) - MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG)
MuRAG accesses an external non-parametric multimodal memory to augment language generation.
Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - See, Hear, Read: Leveraging Multimodality with Guided Attention for
Abstractive Text Summarization [14.881597737762316]
We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc.
We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
arXiv Detail & Related papers (2021-05-20T08:56:33Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.