The Scenario Refiner: Grounding subjects in images at the morphological
level
- URL: http://arxiv.org/abs/2309.11252v1
- Date: Wed, 20 Sep 2023 12:23:06 GMT
- Title: The Scenario Refiner: Grounding subjects in images at the morphological
level
- Authors: Claudia Tagliaferri, Sofia Axioti, Albert Gatt and Denis Paperno
- Abstract summary: We ask whether Vision and Language (V&L) models capture such distinctions at the morphological level.
We compare the results from V&L models to human judgements and find that models' predictions differ from those of human participants.
- Score: 2.401993998791928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Derivationally related words, such as "runner" and "running", exhibit
semantic differences which also elicit different visual scenarios. In this
paper, we ask whether Vision and Language (V\&L) models capture such
distinctions at the morphological level, using a a new methodology and dataset.
We compare the results from V\&L models to human judgements and find that
models' predictions differ from those of human participants, in particular
displaying a grammatical bias. We further investigate whether the human-model
misalignment is related to model architecture. Our methodology, developed on
one specific morphological contrast, can be further extended for testing models
on capturing other nuanced language features.
Related papers
- Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - Evaluating Vision-Language Models on Bistable Images [34.492117496933915]
This study is the most extensive examination of vision-language models using bistable images to date.
We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 116 different manipulations in brightness, tint, and rotation.
Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another.
arXiv Detail & Related papers (2024-05-29T18:04:59Z) - Pairing Orthographically Variant Literary Words to Standard Equivalents
Using Neural Edit Distance Models [0.0]
We present a novel corpus consisting of orthographically variant words found in works of 19th century U.S. literature annotated with their corresponding "standard" word pair.
We train a set of neural edit distance models to pair these variants with their standard forms, and compare the performance of these models to the performance of a set of neural edit distance models trained on a corpus of orthographic errors made by L2 English learners.
arXiv Detail & Related papers (2024-01-26T18:49:34Z) - Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data.
We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z) - Longer Fixations, More Computation: Gaze-Guided Recurrent Neural
Networks [12.57650361978445]
Humans read texts at a varying pace, while machine learning models treat each token in the same way.
In this paper, we convert this intuition into a set of novel models with fixation-guided parallel RNNs or layers.
We find that, interestingly, the fixation duration predicted by neural networks bears some resemblance to humans' fixation.
arXiv Detail & Related papers (2023-10-31T21:32:11Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - The Grammar-Learning Trajectories of Neural Language Models [42.32479280480742]
We show that neural language models acquire linguistic phenomena in a similar order, despite having different end performances over the data.
Results suggest that NLMs exhibit consistent developmental'' stages.
arXiv Detail & Related papers (2021-09-13T16:17:23Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Similarity Analysis of Contextual Word Representation Models [39.12749165544309]
We use existing and novel similarity measures to gauge the level of localization of information in the deep models.
The analysis reveals that models within the same family are more similar to one another, as may be expected.
Surprisingly, different architectures have rather similar representations, but different individual neurons.
arXiv Detail & Related papers (2020-05-03T19:48:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.