Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation
- URL: http://arxiv.org/abs/2012.04726v2
- Date: Wed, 26 Mar 2025 20:17:54 GMT
- Title: Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation
- Authors: Jeff Da, Maxwell Forbes, Rowan Zellers, Anthony Zheng, Jena D. Hwang, Antoine Bosselut, Yejin Choi,
- Abstract summary: Multimodal disinformation, from 'deepfakes' to simple edits that deceive, is an important societal problem.<n>The difference between this example, and harmful edits that spread disinformation, is one of intent.<n> Recognizing and describing this intent is a major challenge for today's AI systems.
- Score: 62.68385635551825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal disinformation, from 'deepfakes' to simple edits that deceive, is an important societal problem. Yet at the same time, the vast majority of media edits are harmless -- such as a filtered vacation photo. The difference between this example, and harmful edits that spread disinformation, is one of intent. Recognizing and describing this intent is a major challenge for today's AI systems. We present the task of Edited Media Understanding, requiring models to answer open-ended questions that capture the intent and implications of an image edit. We introduce a dataset for our task, EMU, with 48k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 40.35% of the time. At the same time, there is still much work to be done -- humans prefer human-annotated captions 93.56% of the time -- and we provide analysis that highlights areas for further progress.
Related papers
- HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images [4.468589513127865]
Vision Question Answering (VQA) tasks use images to convey critical information to answer text-based questions.
Our dataset and model will be released soon.
arXiv Detail & Related papers (2024-12-24T10:25:41Z) - HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing [93.06156989757994]
HumanEdit comprises 5,751 images and requires more than 2,500 hours of human effort across four stages.
The dataset includes six distinct types of editing instructions: Action, Add, Counting, Relation, Remove, and Replace.
HumanEdit offers comprehensive diversity and high-resolution $1024 times 1024$ content sourced from various domains.
arXiv Detail & Related papers (2024-12-05T16:00:59Z) - OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision [32.33777277141083]
We present omniedit, an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly.
omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage.
We provide images with different aspect ratios to ensure that our model can handle any image in the wild.
arXiv Detail & Related papers (2024-11-11T18:21:43Z) - Learning Action and Reasoning-Centric Image Editing from Videos and Simulations [45.637947364341436]
AURORA dataset is a collection of high-quality training data, human-annotated and curated from videos and simulation engines.
We evaluate an AURORA-finetuned model on a new expert-curated benchmark covering 8 diverse editing tasks.
Our model significantly outperforms previous editing models as judged by human raters.
arXiv Detail & Related papers (2024-07-03T19:36:33Z) - The Change You Want to See (Now in 3D) [65.61789642291636]
The goal of this paper is to detect what has changed, if anything, between two "in the wild" images of the same 3D scene.
We contribute a change detection model that is trained entirely on synthetic data and is class-agnostic.
We release a new evaluation dataset consisting of real-world image pairs with human-annotated differences.
arXiv Detail & Related papers (2023-08-21T01:59:45Z) - MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing [48.204992417461575]
We introduce MagicBrush, the first large-scale, manually annotated dataset for instruction-guided real image editing.
We show that the new model can produce much better images according to human evaluation.
arXiv Detail & Related papers (2023-06-16T17:58:58Z) - Interpretable Detection of Out-of-Context Misinformation with Neural-Symbolic-Enhanced Large Multimodal Model [16.348950072491697]
Misinformation creators now more tend to use out-of- multimedia contents to deceive the public and fake news detection systems.
This new type of misinformation increases the difficulty of not only detection but also clarification, because every individual modality is close enough to true information.
In this paper we explore how to achieve interpretable cross-modal de-contextualization detection that simultaneously identifies the mismatched pairs and the cross-modal contradictions.
arXiv Detail & Related papers (2023-04-15T21:11:55Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - MM-Claims: A Dataset for Multimodal Claim Detection in Social Media [7.388174516838141]
We introduce a novel dataset, MM-Claims, which consists of tweets and corresponding images over three topics: COVID-19, Climate Change and broadly Technology.
We describe the dataset in detail, evaluate strong unimodal and multimodal baselines, and analyze the potential and drawbacks of current models.
arXiv Detail & Related papers (2022-05-04T10:43:58Z) - Vision Models Are More Robust And Fair When Pretrained On Uncurated
Images Without Supervision [38.22842778742829]
Discriminative self-supervised learning allows training models on any random group of internet images.
We train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn.
We extensively study and validate our model performance on over 50 benchmarks including fairness, to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets.
arXiv Detail & Related papers (2022-02-16T22:26:47Z) - Learning by Planning: Language-Guided Global Image Editing [53.72807421111136]
We develop a text-to-operation model to map the vague editing language request into a series of editing operations.
The only supervision in the task is the target image, which is insufficient for a stable training of sequential decisions.
We propose a novel operation planning algorithm to generate possible editing sequences from the target image as pseudo ground truth.
arXiv Detail & Related papers (2021-06-24T16:30:03Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.