Multi-Modal Semantic Inconsistency Detection in Social Media News Posts
- URL: http://arxiv.org/abs/2105.12855v1
- Date: Wed, 26 May 2021 21:25:27 GMT
- Title: Multi-Modal Semantic Inconsistency Detection in Social Media News Posts
- Authors: Scott McCrae, Kehan Wang, Avideh Zakhor
- Abstract summary: We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts.
To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts.
- Score: 1.160208922584163
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As computer-generated content and deepfakes make steady improvements,
semantic approaches to multimedia forensics will become more important. In this
paper, we introduce a novel classification architecture for identifying
semantic inconsistencies between video appearance and text caption in social
media news posts. We develop a multi-modal fusion framework to identify
mismatches between videos and captions in social media posts by leveraging an
ensemble method based on textual analysis of the caption, automatic audio
transcription, semantic video analysis, object detection, named entity
consistency, and facial verification. To train and test our approach, we curate
a new video-based dataset of 4,000 real-world Facebook news posts for analysis.
Our multi-modal approach achieves 60.5% classification accuracy on random
mismatches between caption and appearance, compared to accuracy below 50% for
uni-modal models. Further ablation studies confirm the necessity of fusion
across modalities for correctly identifying semantic inconsistencies.
Related papers
- Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach [10.376378437321437]
We propose a Multimedia Misinformation Detection framework for detecting misinformation from video content by leveraging cross-modal entity consistency.
Our results demonstrate that MultiMD outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2024-08-16T16:14:36Z) - From Text to Pixels: A Context-Aware Semantic Synergy Solution for
Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images.
Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Misinformation Detection in Social Media Video Posts [0.4724825031148411]
Short-form video by social media platforms has become a critical challenge for social media providers.
We develop methods to detect misinformation in social media posts, exploiting modalities such as video and text.
We collect 160,000 video posts from Twitter, and leverage self-supervised learning to learn expressive representations of joint visual and textual data.
arXiv Detail & Related papers (2022-02-15T20:14:54Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media [93.51739200834837]
We propose a dataset where both image and text are unmanipulated but mismatched.
We introduce several strategies for automatic retrieval of suitable images for the given captions.
Our large-scale automatically generated NewsCLIPpings dataset requires models to jointly analyze both modalities.
arXiv Detail & Related papers (2021-04-13T01:53:26Z) - MEG: Multi-Evidence GNN for Multimodal Semantic Forensics [28.12652559292884]
Fake news often involves semantic manipulations across modalities such as image, text, location etc.
Recent research has centered the problem around images, calling it image repurposing.
We introduce a novel graph neural network based model for multimodal semantic forensics.
arXiv Detail & Related papers (2020-11-23T09:01:28Z) - Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval [41.505920288928365]
multimodal data has inspired interest in cross-modal retrieval methods.
We propose novel within-modality losses which encourage semantic coherency in both the text and image subspaces.
Our method ensures that not only are paired images and texts close, but the expected image-image and text-text relationships are also observed.
arXiv Detail & Related papers (2020-07-16T20:32:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.