NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media
- URL: http://arxiv.org/abs/2104.05893v1
- Date: Tue, 13 Apr 2021 01:53:26 GMT
- Title: NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media
- Authors: Grace Luo, Trevor Darrell, Anna Rohrbach
- Abstract summary: We propose a dataset where both image and text are unmanipulated but mismatched.
We introduce several strategies for automatic retrieval of suitable images for the given captions.
Our large-scale automatically generated NewsCLIPpings dataset requires models to jointly analyze both modalities.
- Score: 93.51739200834837
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The threat of online misinformation is hard to overestimate, with adversaries
relying on a range of tools, from cheap fakes to sophisticated deep fakes. We
are motivated by a threat scenario where an image is being used out of context
to support a certain narrative expressed in a caption. While some prior
datasets for detecting image-text inconsistency can be solved with blind models
due to linguistic cues introduced by text manipulation, we propose a dataset
where both image and text are unmanipulated but mismatched. We introduce
several strategies for automatic retrieval of suitable images for the given
captions, capturing cases with related semantics but inconsistent entities as
well as matching entities but inconsistent semantic context. Our large-scale
automatically generated NewsCLIPpings Dataset requires models to jointly
analyze both modalities and to reason about entity mismatch as well as semantic
mismatch between text and images in news media.
Related papers
- Exposing Text-Image Inconsistency Using Diffusion Models [36.820267498751626]
A growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning.
This study introduces D-TIIL, which employs text-to-image diffusion models to localize semantic inconsistencies in text and image pairs.
D-TIIL offers a scalable and evidence-based approach to identifying and localizing text-image inconsistency.
arXiv Detail & Related papers (2024-04-28T00:29:24Z) - TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs.
We parse objects and attributes from the description, which are highly likely to exist in the image.
Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z) - What You See is What You Read? Improving Text-Image Alignment Evaluation [28.722369586165108]
We study methods for automatic text-image alignment evaluation.
We first introduce SeeTRUE, spanning multiple datasets from both text-to-image and image-to-text generation tasks.
We describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models.
arXiv Detail & Related papers (2023-05-17T17:43:38Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - Revising Image-Text Retrieval via Multi-Modal Entailment [25.988058843564335]
Many-to-many matching phenomenon is quite common in the widely-used image-text retrieval datasets.
We propose a multi-modal entailment classifier to determine whether a sentence is entailed by an image plus its linked captions.
arXiv Detail & Related papers (2022-08-22T07:58:54Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Catching Out-of-Context Misinformation with Self-supervised Learning [2.435006380732194]
We propose a new method that automatically detects out-of-context image and text pairs.
Our core idea is a self-supervised training strategy where we only need images with matching captions from different sources.
Our method achieves 82% out-of-context detection accuracy.
arXiv Detail & Related papers (2021-01-15T19:00:42Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.