Related papers: Enhancing Multimodal Misinformation Detection by Replaying the Whole Story from Image Modality Perspective

Enhancing Multimodal Misinformation Detection by Replaying the Whole Story from Image Modality Perspective

URL: http://arxiv.org/abs/2511.06284v1
Date: Sun, 09 Nov 2025 08:37:46 GMT
Title: Enhancing Multimodal Misinformation Detection by Replaying the Whole Story from Image Modality Perspective
Authors: Bing Wang, Ximing Li, Yanjun Wang, Changchun Li, Lin Yuanbo Wu, Buyu Wang, Shengsheng Wang,
Abstract summary: Multimodal Misinformation Detection (MMD) refers to the task of detecting social media posts involving misinformation, where the post often contains text and image modalities.<n>We propose a new MMD method named RETSIMD. Specifically, we suppose that each text can be divided into several segments, and each text segment describes a partial scene that can be presented by an image.<n>We further incorporate two auxiliary objectives concerning text-image and image-label mutual information, and further post-train the generator over an auxiliary text-to-image generation benchmark dataset.
Score: 23.51937497342985
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Misinformation Detection (MMD) refers to the task of detecting social media posts involving misinformation, where the post often contains text and image modalities. However, by observing the MMD posts, we hold that the text modality may be much more informative than the image modality because the text generally describes the whole event/story of the current post but the image often presents partial scenes only. Our preliminary empirical results indicate that the image modality exactly contributes less to MMD. Upon this idea, we propose a new MMD method named RETSIMD. Specifically, we suppose that each text can be divided into several segments, and each text segment describes a partial scene that can be presented by an image. Accordingly, we split the text into a sequence of segments, and feed these segments into a pre-trained text-to-image generator to augment a sequence of images. We further incorporate two auxiliary objectives concerning text-image and image-label mutual information, and further post-train the generator over an auxiliary text-to-image generation benchmark dataset. Additionally, we propose a graph structure by defining three heuristic relationships between images, and use a graph neural network to generate the fused features. Extensive empirical results validate the effectiveness of RETSIMD.

Related papers

GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval [12.483996028288407]
Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions.<n>To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective.<n>We conduct extensive experiments on three public TIPR datasets, CUHK-PEDES, RSTPReid, and ICFG-PEDES, to evaluate the performance of GEA.
arXiv Detail & Related papers (2025-11-13T10:06:41Z)
Multimodal Medical Image Binding via Shared Text Embeddings [15.504918331492716]
Multimodal Medical Image Binding with Text (Mtextsuperscript3Bind) is a novel pre-training framework that enables seamless alignment of medical imaging modalities.<n>Mtextsuperscript3Bind first fine-tunes CLIP-like image-text models to align their modality-specific text embedding space.<n>We show that Mtextsuperscript3Bind achieves state-of-the-art performance in zero-shot, few-shot classification and cross-modal retrieval tasks.
arXiv Detail & Related papers (2025-06-22T15:39:25Z)
Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM)<n>AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z)
SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues [11.856041847833666]
We present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation.
arXiv Detail & Related papers (2024-06-27T17:46:13Z)
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities. We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding. Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z)
Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z)
HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval. First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively. Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module. Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z)
NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos. We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images. We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z)
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions. StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN. visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene. This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z)
VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks [5.840117063192334]
We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input. We train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks. The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation.
arXiv Detail & Related papers (2020-10-07T05:25:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.