3VL: using Trees to teach Vision & Language models compositional
concepts
- URL: http://arxiv.org/abs/2312.17345v1
- Date: Thu, 28 Dec 2023 20:26:03 GMT
- Title: 3VL: using Trees to teach Vision & Language models compositional
concepts
- Authors: Nir Yellinek, Leonid Karlinsky and Raja Giryes
- Abstract summary: We introduce the Tree-augmented Vision-Language (3VL) model architecture and training technique.
We show how Anchor, a simple technique for text unification, can be employed to filter nuisance factors.
We also exhibit how DiRe, which performs a differential relevancy comparison between VLM maps, enables us to generate compelling visualizations of a model's success or failure.
- Score: 45.718319397947056
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-Language models (VLMs) have proved effective at aligning image and
text representations, producing superior zero-shot results when transferred to
many downstream tasks. However, these representations suffer some key
shortcomings in Compositional Language Concepts (CLC) understanding such as
recognizing objects' attributes, states, and relations between different
objects. Moreover, VLMs typically have poor interpretability, making it
challenging to debug and mitigate compositional-understanding failures. In this
work, we introduce the Tree-augmented Vision-Language (3VL) model architecture
and training technique accompanied by our proposed Anchor inference method and
Differential Relevance (DiRe) interpretability tool. By expanding the text of
an arbitrary image-text pair into a hierarchical tree structure using language
analysis tools, 3VL allows inducing this structure into the visual
representation learned by the model, enhancing its interpretability and
compositional reasoning. Additionally, we show how Anchor, a simple technique
for text unification, can be employed to filter nuisance factors while
increasing CLC understanding performance, e.g., on the fundamental VL-Checklist
benchmark. We also exhibit how DiRe, which performs a differential comparison
between VLM relevancy maps, enables us to generate compelling visualizations of
the reasons for a model's success or failure.
Related papers
- Causal Graphical Models for Vision-Language Compositional Understanding [36.24185263818946]
We show that our method significantly outperforms all the state-of-the-art compositional approaches by a large margin.
It also improves over methods trained using much larger datasets.
arXiv Detail & Related papers (2024-12-12T15:22:03Z) - Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses [31.85977999591524]
Vision-Language Models (VLMs) achieved strong performance on a variety of tasks (e.g., image-text retrieval, visual question answering)
We propose HIerarchically STructured Learning (HIST) that enhances VLM training without any additional supervision.
arXiv Detail & Related papers (2024-12-11T05:36:18Z) - Language Model as Visual Explainer [72.88137795439407]
We present a systematic approach for interpreting vision models using a tree-structured linguistic explanation.
Our method provides human-understandable explanations in the form of attribute-laden trees.
To access the effectiveness of our approach, we introduce new benchmarks and conduct rigorous evaluations.
arXiv Detail & Related papers (2024-12-08T20:46:23Z) - Unified Lexical Representation for Interpretable Visual-Language Alignment [52.059812317944434]
We introduce LexVLA, a framework for learning a unified lexical representation for both modalities without complex design.
We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability.
We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on the modest multi-modal dataset.
arXiv Detail & Related papers (2024-07-25T07:35:27Z) - Leveraging VLM-Based Pipelines to Annotate 3D Objects [68.51034848207355]
We propose an alternative algorithm to marginalize over factors such as the viewpoint that affect the VLM's response.
Instead of merging text-only responses, we utilize the VLM's joint image-text likelihoods.
We show how a VLM-based pipeline can be leveraged to produce reliable annotations for 764K objects from the 764K dataset.
arXiv Detail & Related papers (2023-11-29T17:54:22Z) - Text encoders bottleneck compositionality in contrastive vision-language
models [76.2406963762722]
We train text-only recovery probes that aim to reconstruct captions from single-vector text representations.
We find that CLIP's text encoder falls short on more compositional inputs.
Results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors.
arXiv Detail & Related papers (2023-05-24T08:48:44Z) - Teaching Structured Vision&Language Concepts to Vision&Language Models [46.344585368641006]
We introduce the collective notion of Structured Vision&Language Concepts (SVLC)
SVLC includes object attributes, relations, and states which are present in the text and visible in the image.
We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
arXiv Detail & Related papers (2022-11-21T18:54:10Z) - Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene
Graphs with Language Structures via Dependency Relationships [17.930724926012264]
We introduce a new task that targets on inducing a joint vision-language structure in an unsupervised manner.
Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly.
We propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones.
arXiv Detail & Related papers (2022-03-27T09:51:34Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.