COVE: COntext and VEracity prediction for out-of-context images
- URL: http://arxiv.org/abs/2502.01194v1
- Date: Mon, 03 Feb 2025 09:33:58 GMT
- Title: COVE: COntext and VEracity prediction for out-of-context images
- Authors: Jonathan Tonglet, Gabriel Thiem, Iryna Gurevych,
- Abstract summary: We introduce COVE, a new method that predicts the true COntext of the image and then uses it to predict the VEracity of the caption.
COVE beats the SOTA context prediction model on all context items, often by more than five percentage points.
It is competitive with the best veracity prediction models on synthetic data and outperforms them on real-world data.
- Score: 49.1574468325115
- License:
- Abstract: Images taken out of their context are the most prevalent form of multimodal misinformation. Debunking them requires (1) providing the true context of the image and (2) checking the veracity of the image's caption. However, existing automated fact-checking methods fail to tackle both objectives explicitly. In this work, we introduce COVE, a new method that predicts first the true COntext of the image and then uses it to predict the VEracity of the caption. COVE beats the SOTA context prediction model on all context items, often by more than five percentage points. It is competitive with the best veracity prediction models on synthetic data and outperforms them on real-world data, showing that it is beneficial to combine the two tasks sequentially. Finally, we conduct a human study that reveals that the predicted context is a reusable and interpretable artifact to verify new out-of-context captions for the same image. Our code and data are made available.
Related papers
- Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models.
A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies.
Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level
Grounding of Images [2.9174297412129957]
We introduce a pre-trained joint embedding (JE), named IMAGINATOR, trained on 21K distinct image objects level from 1M image+text pairs.
IMAGINATOR encapsulates three individual representations: (i) object-object co-location, (ii) word-object co-location, and (iii) word-object correlation.
We also evaluate pre-trained IMAGINATOR JEs on three downstream tasks: (i) image captioning, (ii) Image2Tweet, and (iii) text-based image retrieval.
arXiv Detail & Related papers (2023-05-12T05:34:52Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - OptGAN: Optimizing and Interpreting the Latent Space of the Conditional
Text-to-Image GANs [8.26410341981427]
We study how to ensure that generated samples are believable, realistic or natural.
We present a novel algorithm which identifies semantically-understandable directions in the latent space of a conditional text-to-image GAN architecture.
arXiv Detail & Related papers (2022-02-25T20:00:33Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - PREGAN: Pose Randomization and Estimation for Weakly Paired Image Style
Translation [11.623477199795037]
We propose a weakly-paired setting for the style translation, where the content in the two images is aligned with errors in poses.
PREGAN is validated on both simulated and real-world collected data to show the effectiveness.
arXiv Detail & Related papers (2020-10-31T16:11:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.