Multi-Modal Semantic Inconsistency Detection in Social Media News Posts
- URL: http://arxiv.org/abs/2105.12855v1
- Date: Wed, 26 May 2021 21:25:27 GMT
- Title: Multi-Modal Semantic Inconsistency Detection in Social Media News Posts
- Authors: Scott McCrae, Kehan Wang, Avideh Zakhor
- Abstract summary: We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts.
To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts.
- Score: 1.160208922584163
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As computer-generated content and deepfakes make steady improvements,
semantic approaches to multimedia forensics will become more important. In this
paper, we introduce a novel classification architecture for identifying
semantic inconsistencies between video appearance and text caption in social
media news posts. We develop a multi-modal fusion framework to identify
mismatches between videos and captions in social media posts by leveraging an
ensemble method based on textual analysis of the caption, automatic audio
transcription, semantic video analysis, object detection, named entity
consistency, and facial verification. To train and test our approach, we curate
a new video-based dataset of 4,000 real-world Facebook news posts for analysis.
Our multi-modal approach achieves 60.5% classification accuracy on random
mismatches between caption and appearance, compared to accuracy below 50% for
uni-modal models. Further ablation studies confirm the necessity of fusion
across modalities for correctly identifying semantic inconsistencies.
Related papers
- Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content [7.5253808885104325]
Social media platforms enable the propagation of hateful content across different modalities.
Recent approaches have shown promise in handling individual modalities, but their effectiveness across different modality combinations remains unexplored.
This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content.
arXiv Detail & Related papers (2025-02-11T00:07:40Z) - A New Hybrid Intelligent Approach for Multimodal Detection of Suspected Disinformation on TikTok [0.0]
This study introduces a hybrid framework that combines the computational power of deep learning with the interpretability of fuzzy logic to detect suspected disinformation in TikTok videos.
The methodology is comprised of two core components: a multimodal feature analyser that extracts and evaluates data from text, audio, and video; and a multimodal disinformation detector based on fuzzy logic.
arXiv Detail & Related papers (2025-02-09T12:37:48Z) - Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.
We introduce a new approach that models video-text as game players using multivariate cooperative game theory.
We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z) - Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach [10.376378437321437]
We propose a Multimedia Misinformation Detection framework for detecting misinformation from video content by leveraging cross-modal entity consistency.
Our results demonstrate that MultiMD outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2024-08-16T16:14:36Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Misinformation Detection in Social Media Video Posts [0.4724825031148411]
Short-form video by social media platforms has become a critical challenge for social media providers.
We develop methods to detect misinformation in social media posts, exploiting modalities such as video and text.
We collect 160,000 video posts from Twitter, and leverage self-supervised learning to learn expressive representations of joint visual and textual data.
arXiv Detail & Related papers (2022-02-15T20:14:54Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media [93.51739200834837]
We propose a dataset where both image and text are unmanipulated but mismatched.
We introduce several strategies for automatic retrieval of suitable images for the given captions.
Our large-scale automatically generated NewsCLIPpings dataset requires models to jointly analyze both modalities.
arXiv Detail & Related papers (2021-04-13T01:53:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.