Who's Waldo? Linking People Across Text and Images
- URL: http://arxiv.org/abs/2108.07253v2
- Date: Tue, 17 Aug 2021 15:55:12 GMT
- Title: Who's Waldo? Linking People Across Text and Images
- Authors: Claire Yuqing Cui, Apoorv Khandelwal, Yoav Artzi, Noah Snavely, Hadar
Averbuch-Elor
- Abstract summary: We present a task and benchmark dataset for person-centric visual grounding.
Our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues.
We propose a Transformer-based method that outperforms several strong baselines on this task.
- Score: 56.40556801773923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a task and benchmark dataset for person-centric visual grounding,
the problem of linking between people named in a caption and people pictured in
an image. In contrast to prior work in visual grounding, which is predominantly
object-based, our new task masks out the names of people in captions in order
to encourage methods trained on such image-caption pairs to focus on contextual
cues (such as rich interactions between multiple people), rather than learning
associations between names and appearances. To facilitate this task, we
introduce a new dataset, Who's Waldo, mined automatically from image-caption
data on Wikimedia Commons. We propose a Transformer-based method that
outperforms several strong baselines on this task, and are releasing our data
to the research community to spur work on contextual models that consider both
vision and language.
Related papers
- Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - Learning Human-Human Interactions in Images from Weak Textual
Supervision [15.71242350678679]
We propose a new paradigm of learning human-human interactions as free text from a single still image.
We show that the pseudo-labels produced by this procedure can be used to train a captioning model.
We will release our code and pseudo-labels along with Waldo and Wenda, a manually-curated test set for still image human-human interaction understanding.
arXiv Detail & Related papers (2023-04-27T11:32:48Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - To Find Waldo You Need Contextual Cues: Debiasing Who's Waldo [53.370023611101175]
We present a debiased dataset for the Person-centric Visual Grounding task first proposed by Cui et al.
Given an image and a caption, PCVG requires pairing up a person's name mentioned in a caption with a bounding box that points to the person in the image.
We find that the original Who's Waldo dataset contains a large number of biased samples that are solvable simply by methods.
arXiv Detail & Related papers (2022-03-30T21:35:53Z) - Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks [207.52609682812147]
We propose a new learning method Oscar (Object-Semantics Aligned Pre-training)
It uses object tags detected in images as anchor points to significantly ease the learning of alignments.
We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks.
arXiv Detail & Related papers (2020-04-13T19:18:10Z) - Captioning Images Taken by People Who Are Blind [25.263950448575923]
VizWiz-Captions consists of over 39,000 images originating from people who are blind that are each paired with five captions.
We analyze this dataset to (1) characterize the typical captions, (2) characterize the diversity of content found in the images, and (3) compare its content to that found in eight popular vision datasets.
arXiv Detail & Related papers (2020-02-20T04:36:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.