Related papers: A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

URL: http://arxiv.org/abs/2309.02691v3
Date: Thu, 30 May 2024 21:16:29 GMT
Title: A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Authors: Noriyuki Kojima, Hadar Averbuch-Elor, Yoav Artzi,
Abstract summary: Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. We propose a framework to jointly study task performance and phrase grounding. We show how this can be addressed through brute-force training on ground phrasing annotations.
Score: 28.746370086515977
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and at available at https://github.com/lil-lab/phrase_grounding.

Related papers

Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning [17.00358234728804]
We argue that the text embedding research community should move beyond surface meaning and embrace implicit semantics as a central modeling goal.<n>Current embedding models are typically trained on data that lacks such depth and evaluated on benchmarks that reward the capture of surface meaning.<n>Our pilot study highlights this gap, showing that even state-of-the-art models perform only marginally better than simplistic baselines on implicit semantics tasks.
arXiv Detail & Related papers (2025-06-10T02:11:42Z)
Exploring Spatial Language Grounding Through Referring Expressions [17.524558622186657]
We propose using the Referring Expression task as a platform for the evaluation of spatial reasoning by Vision-language models (VLMs) This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not') Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.
arXiv Detail & Related papers (2025-02-04T22:58:15Z)
Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z)
Learning Zero-Shot Multifaceted Visually Grounded Word Embeddingsvia Multi-Task Training [8.271859911016719]
Language grounding aims at linking the symbolic representation of language (e.g., words) into the rich perceptual knowledge of the outside world. We argue that this approach sacrifices the abstract knowledge obtained from linguistic co-occurrence statistics in the process of acquiring perceptual information.
arXiv Detail & Related papers (2021-04-15T14:49:11Z)
Probing Task-Oriented Dialogue Representation from Language Models [106.02947285212132]
This paper investigates pre-trained language models to find out which model intrinsically carries the most informative representation for task-oriented dialogue tasks. We fine-tune a feed-forward layer as the classifier probe on top of a fixed pre-trained language model with annotated labels in a supervised way.
arXiv Detail & Related papers (2020-10-26T21:34:39Z)
Image Captioning with Visual Object Representations Grounded in the Textual Modality [14.797241131469486]
We explore the possibilities of a shared embedding space between textual and visual modality. We propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system.
arXiv Detail & Related papers (2020-10-19T12:21:38Z)
Visual Relation Grounding in Videos [86.06874453626347]
We explore a novel named visual Relation Grounding in Videos (RGV) This task aims at providing supportive visual facts for other video-language tasks (e.g., video grounding and video question answering) We tackle challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical-temporal region. Experimental results demonstrate our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts.
arXiv Detail & Related papers (2020-07-17T08:20:39Z)
Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition. We show that 83.7% of test instances do not require reasoning on linguistic structure. We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z)
Probing Contextual Language Models for Common Ground with Visual Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
How Far are We from Effective Context Modeling? An Exploratory Study on Semantic Parsing in Context [59.13515950353125]
We present a grammar-based decoding semantic parsing and adapt typical context modeling methods on top of it. We evaluate 13 context modeling methods on two large cross-domain datasets, and our best model achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-02-03T11:28:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.