Multimodal Learning using Optimal Transport for Sarcasm and Humor
Detection
- URL: http://arxiv.org/abs/2110.10949v1
- Date: Thu, 21 Oct 2021 07:51:56 GMT
- Title: Multimodal Learning using Optimal Transport for Sarcasm and Humor
Detection
- Authors: Shraman Pramanick, Aniket Roy and Vishal M. Patel
- Abstract summary: We deal with multimodal sarcasm and humor detection from conversational videos and image-text pairs.
We propose a novel multimodal learning system, MuLOT, which utilizes self-attention to exploit intra-modal correspondence.
We test our approach for multimodal sarcasm and humor detection on three benchmark datasets.
- Score: 76.62550719834722
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal learning is an emerging yet challenging research area. In this
paper, we deal with multimodal sarcasm and humor detection from conversational
videos and image-text pairs. Being a fleeting action, which is reflected across
the modalities, sarcasm detection is challenging since large datasets are not
available for this task in the literature. Therefore, we primarily focus on
resource-constrained training, where the number of training samples is limited.
To this end, we propose a novel multimodal learning system, MuLOT (Multimodal
Learning using Optimal Transport), which utilizes self-attention to exploit
intra-modal correspondence and optimal transport for cross-modal
correspondence. Finally, the modalities are combined with multimodal attention
fusion to capture the inter-dependencies across modalities. We test our
approach for multimodal sarcasm and humor detection on three benchmark datasets
- MUStARD (video, audio, text), UR-FUNNY (video, audio, text), MST (image,
text) and obtain 2.1%, 1.54%, and 2.34% accuracy improvements over
state-of-the-art.
Related papers
- Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - Mamba-Enhanced Text-Audio-Video Alignment Network for Emotion Recognition in Conversations [15.748798247815298]
We propose a novel Mamba-enhanced Text-Audio-Video alignment network (MaTAV) for the Emotion Recognition in Conversations (ERC) task.
MaTAV is with the advantages of aligning unimodal features to ensure consistency across different modalities and handling long input sequences to better capture contextual multimodal information.
arXiv Detail & Related papers (2024-09-08T23:09:22Z) - Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding [7.329728566839757]
We propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF)
MoPE-BAF is a novel multi-modal soft prompt framework based on the unified vision-language model (VLM)
arXiv Detail & Related papers (2024-03-17T19:12:26Z) - Multi-modal Stance Detection: New Datasets and Model [56.97470987479277]
We study multi-modal stance detection for tweets consisting of texts and images.
We propose a simple yet effective Targeted Multi-modal Prompt Tuning framework (TMPT)
TMPT achieves state-of-the-art performance in multi-modal stance detection.
arXiv Detail & Related papers (2024-02-22T05:24:19Z) - Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it.
We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality.
We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z) - On Robustness in Multimodal Learning [75.03719000820388]
Multimodal learning is defined as learning over multiple input modalities such as video, audio, and text.
We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods.
arXiv Detail & Related papers (2023-04-10T05:02:07Z) - Cross-modal Contrastive Learning for Multimodal Fake News Detection [10.760000041969139]
COOLANT is a cross-modal contrastive learning framework for multimodal fake news detection.
A cross-modal fusion module is developed to learn the cross-modality correlations.
An attention guidance module is implemented to help effectively and interpretably aggregate the aligned unimodal representations.
arXiv Detail & Related papers (2023-02-25T10:12:34Z) - TLDW: Extreme Multimodal Summarisation of News Videos [76.50305095899958]
We introduce eXtreme Multimodal Summarisation with Multimodal Output (XMSMO) for the scenario of TL;DW - Too Long; Didn't Watch, akin to TL;DR.
XMSMO aims to summarise a video-document pair into a summary with an extremely short length, which consists of one cover frame as the visual summary and one sentence as the textual summary.
Our method is trained, without using reference summaries, by optimising the visual and textual coverage from the perspectives of the distance between the semantic distributions under optimal transport plans.
arXiv Detail & Related papers (2022-10-16T08:19:59Z) - MISA: Modality-Invariant and -Specific Representations for Multimodal
Sentiment Analysis [48.776247141839875]
We propose a novel framework, MISA, which projects each modality to two distinct subspaces.
The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap.
Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models.
arXiv Detail & Related papers (2020-05-07T15:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.