The Abduction of Sherlock Holmes: A Dataset for Visual Abductive
Reasoning
- URL: http://arxiv.org/abs/2202.04800v1
- Date: Thu, 10 Feb 2022 02:26:45 GMT
- Title: The Abduction of Sherlock Holmes: A Dataset for Visual Abductive
Reasoning
- Authors: Jack Hessel and Jena D. Hwang and Jae Sung Park and Rowan Zellers and
Chandra Bhagavatula and Anna Rohrbach and Kate Saenko and Yejin Choi
- Abstract summary: Humans have remarkable capacity to reason abductively and hypothesize about what lies beyond the literal content of an image.
We present Sherlock, an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents.
- Score: 113.25016899663191
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans have remarkable capacity to reason abductively and hypothesize about
what lies beyond the literal content of an image. By identifying concrete
visual clues scattered throughout a scene, we almost can't help but draw
probable inferences beyond the literal scene based on our everyday experience
and knowledge about the world. For example, if we see a "20 mph" sign alongside
a road, we might assume the street sits in a residential area (rather than on a
highway), even if no houses are pictured. Can machines perform similar visual
reasoning?
We present Sherlock, an annotated corpus of 103K images for testing machine
capacity for abductive reasoning beyond literal image contents. We adopt a
free-viewing paradigm: participants first observe and identify salient clues
within images (e.g., objects, actions) and then provide a plausible inference
about the scene, given the clue. In total, we collect 363K (clue, inference)
pairs, which form a first-of-its-kind abductive visual reasoning dataset. Using
our corpus, we test three complementary axes of abductive reasoning. We
evaluate the capacity of models to: i) retrieve relevant inferences from a
large candidate corpus; ii) localize evidence for inferences via bounding
boxes, and iii) compare plausible inferences to match human judgments on a
newly-collected diagnostic corpus of 19K Likert-scale judgments. While we find
that fine-tuning CLIP-RN50x64 with a multitask objective outperforms strong
baselines, significant headroom exists between model performance and human
agreement. We provide analysis that points towards future work.
Related papers
- When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Find Someone Who: Visual Commonsense Understanding in Human-Centric
Grounding [87.39245901710079]
We present a new commonsense task, Human-centric Commonsense Grounding.
It tests the models' ability to ground individuals given the context descriptions about what happened before.
We set up a context-object-aware method as a strong baseline that outperforms previous pre-trained and non-pretrained models.
arXiv Detail & Related papers (2022-12-14T01:37:16Z) - Prediction of Scene Plausibility [11.641785968519114]
Plausibility can be defined both in terms of physical properties and in terms of functional and typical arrangements.
We build a dataset of synthetic images containing both plausible and implausible scenes.
We test the success of various vision models in the task of recognizing and understanding plausibility.
arXiv Detail & Related papers (2022-12-02T22:22:16Z) - Visual Abductive Reasoning [85.17040703205608]
Abductive reasoning seeks the likeliest possible explanation for partial observations.
We propose a new task and dataset, Visual Abductive Reasoning ( VAR), for examining abductive reasoning ability of machine intelligence in everyday visual situations.
arXiv Detail & Related papers (2022-03-26T10:17:03Z) - What does it mean to represent? Mental representations as falsifiable
memory patterns [8.430851504111585]
We argue that causal and teleological approaches fail to provide a satisfactory account of representation.
We sketch an alternative according to which representations correspond to inferred latent structures in the world.
These structures are assumed to have certain properties objectively, which allows for planning, prediction, and detection of unexpected events.
arXiv Detail & Related papers (2022-03-06T12:52:42Z) - PTR: A Benchmark for Part-based Conceptual, Relational, and Physical
Reasoning [135.2892665079159]
We introduce a new large-scale diagnostic visual reasoning dataset named PTR.
PTR contains around 70k RGBD synthetic images with ground truth object and part level annotations.
We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes.
arXiv Detail & Related papers (2021-12-09T18:59:34Z) - Abstract Spatial-Temporal Reasoning via Probabilistic Abduction and
Execution [97.50813120600026]
Spatial-temporal reasoning is a challenging task in Artificial Intelligence (AI)
Recent works have focused on an abstract reasoning task of this kind -- Raven's Progressive Matrices ( RPM)
We propose a neuro-symbolic Probabilistic Abduction and Execution learner (PrAE) learner.
arXiv Detail & Related papers (2021-03-26T02:42:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.