Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context
Images via Online Resources
- URL: http://arxiv.org/abs/2112.00061v1
- Date: Tue, 30 Nov 2021 19:36:20 GMT
- Title: Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context
Images via Online Resources
- Authors: Sahar Abdelnabi, Rakibul Hasan, Mario Fritz
- Abstract summary: A real image is re-purposed to support other narratives by misrepresenting its context and/or elements.
Our goal is an inspectable method that automates this time-consuming and reasoning-intensive process by fact-checking the image-context pairing.
Our work offers the first step and benchmark for open-domain, content-based, multi-modal fact-checking.
- Score: 70.68526820807402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Misinformation is now a major problem due to its potential high risks to our
core democratic and societal values and orders. Out-of-context misinformation
is one of the easiest and effective ways used by adversaries to spread viral
false stories. In this threat, a real image is re-purposed to support other
narratives by misrepresenting its context and/or elements. The internet is
being used as the go-to way to verify information using different sources and
modalities. Our goal is an inspectable method that automates this
time-consuming and reasoning-intensive process by fact-checking the
image-caption pairing using Web evidence. To integrate evidence and cues from
both modalities, we introduce the concept of 'multi-modal cycle-consistency
check'; starting from the image/caption, we gather textual/visual evidence,
which will be compared against the other paired caption/image, respectively.
Moreover, we propose a novel architecture, Consistency-Checking Network (CCN),
that mimics the layered human reasoning across the same and different
modalities: the caption vs. textual evidence, the image vs. visual evidence,
and the image vs. caption. Our work offers the first step and benchmark for
open-domain, content-based, multi-modal fact-checking, and significantly
outperforms previous baselines that did not leverage external evidence.
Related papers
- "Image, Tell me your story!" Predicting the original meta-context of visual misinformation [70.52796410062876]
We introduce an automated system that grounds images in their original meta-context using the content of the image and textual evidence retrieved from the open web.
Our experiments show promising results while highlighting several open challenges in retrieval and reasoning.
arXiv Detail & Related papers (2024-08-19T12:21:34Z) - Similarity over Factuality: Are we making progress on multimodal out-of-context misinformation detection? [15.66049149213069]
Out-of-context (OOC) misinformation poses a significant challenge in multimodal fact-checking.
Recent research in evidence-based OOC detection has seen a trend towards increasingly complex architectures.
We introduce a simple yet robust baseline, which assesses similarity between image-text pairs and external image and text evidence.
arXiv Detail & Related papers (2024-07-18T13:08:55Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Support or Refute: Analyzing the Stance of Evidence to Detect
Out-of-Context Mis- and Disinformation [13.134162427636356]
Mis- and disinformation online have become a major societal problem.
One common form of mis- and disinformation is out-of-context (OOC) information.
We propose a stance extraction network (SEN) that can extract the stances of different pieces of multi-modal evidence.
arXiv Detail & Related papers (2023-11-03T08:05:54Z) - A Multi-Modal Context Reasoning Approach for Conditional Inference on
Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task.
We propose a Multi-modal Context Reasoning approach, named ModCR.
We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z) - Interpretable Detection of Out-of-Context Misinformation with Neural-Symbolic-Enhanced Large Multimodal Model [16.348950072491697]
Misinformation creators now more tend to use out-of- multimedia contents to deceive the public and fake news detection systems.
This new type of misinformation increases the difficulty of not only detection but also clarification, because every individual modality is close enough to true information.
In this paper we explore how to achieve interpretable cross-modal de-contextualization detection that simultaneously identifies the mismatched pairs and the cross-modal contradictions.
arXiv Detail & Related papers (2023-04-15T21:11:55Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z) - Learning Transformation-Aware Embeddings for Image Forensics [15.484408315588569]
Image Provenance Analysis aims at discovering relationships among different manipulated image versions that share content.
One of the main sub-problems for provenance analysis that has not yet been addressed directly is the edit ordering of images that share full content or are near-duplicates.
This paper introduces a novel deep learning-based approach to provide a plausible ordering to images that have been generated from a single image through transformations.
arXiv Detail & Related papers (2020-01-13T22:01:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.