Multimodal Rumor Detection Enhanced by External Evidence and Forgery Features
- URL: http://arxiv.org/abs/2601.14954v1
- Date: Wed, 21 Jan 2026 12:53:18 GMT
- Title: Multimodal Rumor Detection Enhanced by External Evidence and Forgery Features
- Authors: Han Li, Hua Sun,
- Abstract summary: Social media increasingly disseminates information through mixed image text posts.<n>Deep semantic mismatch rumors pose particular challenges and threaten online public opinion.<n>Existing multimodal rumor detection methods suffer from limited feature extraction, noisy alignment, and inflexible fusion strategies.<n>We propose a multimodal rumor detection model enhanced with external evidence and forgery features.
- Score: 21.522558828688343
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Social media increasingly disseminates information through mixed image text posts, but rumors often exploit subtle inconsistencies and forged content, making detection based solely on post content difficult. Deep semantic mismatch rumors, which superficially align images and texts, pose particular challenges and threaten online public opinion. Existing multimodal rumor detection methods improve cross modal modeling but suffer from limited feature extraction, noisy alignment, and inflexible fusion strategies, while ignoring external factual evidence necessary for verifying complex rumors. To address these limitations, we propose a multimodal rumor detection model enhanced with external evidence and forgery features. The model uses a ResNet34 visual encoder, a BERT text encoder, and a forgery feature module extracting frequency-domain traces and compression artifacts via Fourier transformation. BLIP-generated image descriptions bridge image and text semantic spaces. A dual contrastive learning module computes contrastive losses between text image and text description pairs, improving detection of semantic inconsistencies. A gated adaptive feature-scaling fusion mechanism dynamically adjusts multimodal fusion and reduces redundancy. Experiments on Weibo and Twitter datasets demonstrate that our model outperforms mainstream baselines in macro accuracy, recall, and F1 score.
Related papers
- Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance [10.079930398169205]
Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms.<n> extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances.<n> multimodal fusion often suffers from redundancy and imbalance.
arXiv Detail & Related papers (2026-02-11T05:44:30Z) - Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z) - Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection [36.96267014127019]
MMChange is a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness.<n>To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images.<n>A Textual Difference Enhancement (TDE) module captures fine grained semantic shifts, guiding the model toward meaningful changes.
arXiv Detail & Related papers (2025-09-04T07:39:18Z) - NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation [70.96827354717459]
diffusion model has provided a strong tool for implementing text-to-image (T2I) and image-to-image (I2I) generation.<n>We propose a noise finetune NOFT module employed by Stable Diffusion to generate highly correlated and diverse images.
arXiv Detail & Related papers (2025-05-18T05:09:47Z) - Text-DiFuse: An Interactive Multi-Modal Image Fusion Framework based on Text-modulated Diffusion Model [30.739879255847946]
Existing multi-modal image fusion methods fail to address the compound degradations presented in source images.
This study proposes a novel interactive multi-modal image fusion framework based on the text-modulated diffusion model, called Text-DiFuse.
arXiv Detail & Related papers (2024-10-31T13:10:50Z) - Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning [53.766434746801366]
Multimodal contrastive learning (MCL) has shown remarkable advances in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet.
Hackers may unauthorizedly exploit image-text data for model training, potentially including personal and privacy-sensitive information.
Recent works propose generating unlearnable examples by adding imperceptible perturbations to training images to build shortcuts for protection.
We propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples.
arXiv Detail & Related papers (2024-07-23T09:00:52Z) - On the Multi-modal Vulnerability of Diffusion Models [56.08923332178462]
We propose MMP-Attack to manipulate the generation results of diffusion models by appending a specific suffix to the original prompt.<n>Our goal is to induce diffusion models to generate a specific object while simultaneously eliminating the original object.
arXiv Detail & Related papers (2024-02-02T12:39:49Z) - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z) - Benchmarking Robustness of Multimodal Image-Text Models under
Distribution Shift [50.64474103506595]
We investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks.
Character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data.
arXiv Detail & Related papers (2022-12-15T18:52:03Z) - Multimodal Fake News Detection with Adaptive Unimodal Representation
Aggregation [28.564442206829625]
AURA is a multimodal fake news detection network with adaptive unimodal representation aggregation.
We perform coarse-level fake news detection and cross-modal cosistency learning according to the unimodal and multimodal representations.
Experiments on Weibo and Gossipcop prove that AURA can successfully beat several state-of-the-art FND schemes.
arXiv Detail & Related papers (2022-06-12T14:06:55Z) - FiLMing Multimodal Sarcasm Detection with Attention [0.7340017786387767]
Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning.
We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes.
Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal detection dataset.
arXiv Detail & Related papers (2021-08-09T06:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.