Multi-modal Stance Detection: New Datasets and Model
- URL: http://arxiv.org/abs/2402.14298v3
- Date: Thu, 6 Jun 2024 15:17:58 GMT
- Title: Multi-modal Stance Detection: New Datasets and Model
- Authors: Bin Liang, Ang Li, Jingqian Zhao, Lin Gui, Min Yang, Yue Yu, Kam-Fai Wong, Ruifeng Xu,
- Abstract summary: We study multi-modal stance detection for tweets consisting of texts and images.
We propose a simple yet effective Targeted Multi-modal Prompt Tuning framework (TMPT)
TMPT achieves state-of-the-art performance in multi-modal stance detection.
- Score: 56.97470987479277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stance detection is a challenging task that aims to identify public opinion from social media platforms with respect to specific targets. Previous work on stance detection largely focused on pure texts. In this paper, we study multi-modal stance detection for tweets consisting of texts and images, which are prevalent in today's fast-growing social media platforms where people often post multi-modal messages. To this end, we create five new multi-modal stance detection datasets of different domains based on Twitter, in which each example consists of a text and an image. In addition, we propose a simple yet effective Targeted Multi-modal Prompt Tuning framework (TMPT), where target information is leveraged to learn multi-modal stance features from textual and visual modalities. Experimental results on our five benchmark datasets show that the proposed TMPT achieves state-of-the-art performance in multi-modal stance detection.
Related papers
- Multimodal Multi-turn Conversation Stance Detection: A Challenge Dataset and Effective Model [9.413870182630362]
We introduce a new multimodal multi-turn conversational stance detection dataset (called MmMtCSD)
We propose a novel multimodal large language model stance detection framework (MLLM-SD), that learns joint stance representations from textual and visual modalities.
Experiments on MmMtCSD show state-of-the-art performance of our proposed MLLM-SD approach for multimodal stance detection.
arXiv Detail & Related papers (2024-09-01T03:16:30Z) - Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach [10.376378437321437]
We propose a Multimedia Misinformation Detection framework for detecting misinformation from video content by leveraging cross-modal entity consistency.
Our results demonstrate that MultiMD outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2024-08-16T16:14:36Z) - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks.
To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z) - Detecting and Grounding Multi-Modal Media Manipulation and Beyond [93.08116982163804]
We highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4)
DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content.
We propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities.
arXiv Detail & Related papers (2023-09-25T15:05:46Z) - Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
Detection [72.36017150922504]
We propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer to a student detector.
The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM)
arXiv Detail & Related papers (2023-08-30T08:33:13Z) - Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it.
We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality.
We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Detecting and Grounding Multi-Modal Media Manipulation [32.34908534582532]
We highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4)
DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content.
We propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities.
arXiv Detail & Related papers (2023-04-05T16:20:40Z) - Multimodal Fake News Detection with Adaptive Unimodal Representation
Aggregation [28.564442206829625]
AURA is a multimodal fake news detection network with adaptive unimodal representation aggregation.
We perform coarse-level fake news detection and cross-modal cosistency learning according to the unimodal and multimodal representations.
Experiments on Weibo and Gossipcop prove that AURA can successfully beat several state-of-the-art FND schemes.
arXiv Detail & Related papers (2022-06-12T14:06:55Z) - Multimodal Categorization of Crisis Events in Social Media [81.07061295887172]
We present a new multimodal fusion method that leverages both images and texts as input.
In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities.
We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
arXiv Detail & Related papers (2020-04-10T06:31:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.