WinoViz: Probing Visual Properties of Objects Under Different States
- URL: http://arxiv.org/abs/2402.13584v1
- Date: Wed, 21 Feb 2024 07:31:47 GMT
- Title: WinoViz: Probing Visual Properties of Objects Under Different States
- Authors: Woojeong Jin, Tejas Srinivasan, Jesse Thomason, Xiang Ren
- Abstract summary: We present a text-only evaluation dataset consisting of 1,380 examples that probe the reasoning abilities of language models regarding variant visual properties of objects under different contexts or states.
Our task is challenging since it requires pragmatic reasoning (finding intended meanings) and visual knowledge reasoning.
We also present multi-hop data, a more challenging version of our data, which requires multi-step reasoning chains to solve our task.
- Score: 39.92628807477848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans perceive and comprehend different visual properties of an object based
on specific contexts. For instance, we know that a banana turns brown ``when it
becomes rotten,'' whereas it appears green ``when it is unripe.'' Previous
studies on probing visual commonsense knowledge have primarily focused on
examining language models' understanding of typical properties (e.g., colors
and shapes) of objects. We present WinoViz, a text-only evaluation dataset,
consisting of 1,380 examples that probe the reasoning abilities of language
models regarding variant visual properties of objects under different contexts
or states. Our task is challenging since it requires pragmatic reasoning
(finding intended meanings) and visual knowledge reasoning. We also present
multi-hop data, a more challenging version of our data, which requires
multi-step reasoning chains to solve our task. In our experimental analysis,
our findings are: a) Large language models such as GPT-4 demonstrate effective
performance, but when it comes to multi-hop data, their performance is
significantly degraded. b) Large models perform well on pragmatic reasoning,
but visual knowledge reasoning is a bottleneck in our task. c) Vision-language
models outperform their language-model counterparts. d) A model with
machine-generated images performs poorly in our task. This is due to the poor
quality of the generated images.
Related papers
- Visually Grounded Language Learning: a review of language games,
datasets, tasks, and models [60.2604624857992]
Many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality.
In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field.
arXiv Detail & Related papers (2023-12-05T02:17:29Z) - DesCo: Learning Object Recognition with Rich Language Descriptions [93.8177229428617]
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision.
We propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions.
arXiv Detail & Related papers (2023-06-24T21:05:02Z) - What does BERT learn about prosody? [1.1548853370822343]
We study whether prosody is part of the structural information of the language that models learn.
Our results show that information about prosodic prominence spans across many layers but is mostly focused in middle layers suggesting that BERT relies mostly on syntactic and semantic information.
arXiv Detail & Related papers (2023-04-25T10:34:56Z) - Paparazzi: A Deep Dive into the Capabilities of Language and Vision
Models for Grounding Viewpoint Descriptions [4.026600887656479]
We investigate whether a state-of-the-art language and vision model, CLIP, is able to ground perspective descriptions of a 3D object.
We present an evaluation framework that uses a circling camera around a 3D object to generate images from different viewpoints.
We find that a pre-trained CLIP model performs poorly on most canonical views.
arXiv Detail & Related papers (2023-02-13T15:18:27Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - A Closer Look at Linguistic Knowledge in Masked Language Models: The
Case of Relative Clauses in American English [17.993417004424078]
Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on.
We evaluate three models (BERT, RoBERTa, and ALBERT) testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks.
arXiv Detail & Related papers (2020-11-02T13:25:39Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.