Related papers: Multi-Modal Semantic Inconsistency Detection in Social Media News Posts

Multi-Modal Semantic Inconsistency Detection in Social Media News Posts

URL: http://arxiv.org/abs/2105.12855v1
Date: Wed, 26 May 2021 21:25:27 GMT
Title: Multi-Modal Semantic Inconsistency Detection in Social Media News Posts
Authors: Scott McCrae, Kehan Wang, Avideh Zakhor
Abstract summary: We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts. To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts.
Score: 1.160208922584163
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this paper, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts by leveraging an ensemble method based on textual analysis of the caption, automatic audio transcription, semantic video analysis, object detection, named entity consistency, and facial verification. To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts for analysis. Our multi-modal approach achieves 60.5% classification accuracy on random mismatches between caption and appearance, compared to accuracy below 50% for uni-modal models. Further ablation studies confirm the necessity of fusion across modalities for correctly identifying semantic inconsistencies.

Related papers

Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content [7.5253808885104325]
Social media platforms enable the propagation of hateful content across different modalities. Recent approaches have shown promise in handling individual modalities, but their effectiveness across different modality combinations remains unexplored. This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content.
arXiv Detail & Related papers (2025-02-11T00:07:40Z)
A New Hybrid Intelligent Approach for Multimodal Detection of Suspected Disinformation on TikTok [0.0]
This study introduces a hybrid framework that combines the computational power of deep learning with the interpretability of fuzzy logic to detect suspected disinformation in TikTok videos. The methodology is comprised of two core components: a multimodal feature analyser that extracts and evaluates data from text, audio, and video; and a multimodal disinformation detector based on fuzzy logic.
arXiv Detail & Related papers (2025-02-09T12:37:48Z)
Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach [10.376378437321437]
We propose a Multimedia Misinformation Detection framework for detecting misinformation from video content by leveraging cross-modal entity consistency. Our results demonstrate that MultiMD outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2024-08-16T16:14:36Z)
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach [56.610806615527885]
A key challenge in text-video retrieval (TVR) is the information asymmetry between video and text. This paper introduces a data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content. We propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z)
Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity. The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales. The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z)
UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities. We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z)
Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story. Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities. We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z)
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels. contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text. There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
Misinformation Detection in Social Media Video Posts [0.4724825031148411]
Short-form video by social media platforms has become a critical challenge for social media providers. We develop methods to detect misinformation in social media posts, exploiting modalities such as video and text. We collect 160,000 video posts from Twitter, and leverage self-supervised learning to learn expressive representations of joint visual and textual data.
arXiv Detail & Related papers (2022-02-15T20:14:54Z)
Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos. It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z)
NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media [93.51739200834837]
We propose a dataset where both image and text are unmanipulated but mismatched. We introduce several strategies for automatic retrieval of suitable images for the given captions. Our large-scale automatically generated NewsCLIPpings dataset requires models to jointly analyze both modalities.
arXiv Detail & Related papers (2021-04-13T01:53:26Z)
MEG: Multi-Evidence GNN for Multimodal Semantic Forensics [28.12652559292884]
Fake news often involves semantic manipulations across modalities such as image, text, location etc. Recent research has centered the problem around images, calling it image repurposing. We introduce a novel graph neural network based model for multimodal semantic forensics.
arXiv Detail & Related papers (2020-11-23T09:01:28Z)
Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval [41.505920288928365]
multimodal data has inspired interest in cross-modal retrieval methods. We propose novel within-modality losses which encourage semantic coherency in both the text and image subspaces. Our method ensures that not only are paired images and texts close, but the expected image-image and text-text relationships are also observed.
arXiv Detail & Related papers (2020-07-16T20:32:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.