Teaching Structured Vision&Language Concepts to Vision&Language Models
- URL: http://arxiv.org/abs/2211.11733v2
- Date: Tue, 30 May 2023 17:08:43 GMT
- Title: Teaching Structured Vision&Language Concepts to Vision&Language Models
- Authors: Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig,
Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid
Karlinsky
- Abstract summary: We introduce the collective notion of Structured Vision&Language Concepts (SVLC)
SVLC includes object attributes, relations, and states which are present in the text and visible in the image.
We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
- Score: 46.344585368641006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision and Language (VL) models have demonstrated remarkable zero-shot
performance in a variety of tasks. However, some aspects of complex language
understanding still remain a challenge. We introduce the collective notion of
Structured Vision&Language Concepts (SVLC) which includes object attributes,
relations, and states which are present in the text and visible in the image.
Recent studies have shown that even the best VL models struggle with SVLC. A
possible way of fixing this issue is by collecting dedicated datasets for
teaching each SVLC type, yet this might be expensive and time-consuming.
Instead, we propose a more elegant data-driven approach for enhancing VL
models' understanding of SVLCs that makes more effective use of existing VL
pre-training datasets and does not require any additional data. While automatic
understanding of image structure still remains largely unsolved, language
structure is much better modeled and understood, allowing for its effective
utilization in teaching VL models. In this paper, we propose various techniques
based on language structure understanding that can be used to manipulate the
textual part of off-the-shelf paired VL datasets. VL models trained with the
updated data exhibit a significant improvement of up to 15% in their SVLC
understanding with only a mild degradation in their zero-shot capabilities both
when training from scratch or fine-tuning a pre-trained model.
Related papers
- In-Context Learning Improves Compositional Understanding of Vision-Language Models [2.762909189433944]
compositional image understanding remains a rather difficult task due to the object bias present in training data.
We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses.
Our proposed approach outperforms baseline models across multiple compositional understanding datasets.
arXiv Detail & Related papers (2024-07-22T09:03:29Z) - Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation [34.37450315995176]
Current Referring Video Object (RVOS) methods typically use vision and language models pretrained independently as backbones.
We propose a temporal-aware prompt-tuning method, which adapts pretrained representations for pixel-level prediction.
Our method performs favorably against state-of-the-art algorithms and exhibits strong generalization abilities.
arXiv Detail & Related papers (2024-05-17T08:14:22Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - DeepSeek-VL: Towards Real-World Vision-Language Understanding [24.57011093316788]
We present DeepSeek-VL, an open-source Vision-Language (VL) Model for real-world vision and language understanding applications.
Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios.
We create a use case taxonomy from real user scenarios and construct an instruction tuning dataset.
arXiv Detail & Related papers (2024-03-08T18:46:00Z) - MetaVL: Transferring In-Context Learning Ability From Language Models to
Vision-Language Models [74.89629463600978]
In vision-language domain, most large-scale pre-trained vision-language models do not possess the ability to conduct in-context learning.
In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to the vision domain?
arXiv Detail & Related papers (2023-06-02T07:21:03Z) - Going Beyond Nouns With Vision & Language Models Using Synthetic Data [43.87754926411406]
Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications.
Recent works have uncovered a fundamental weakness of these models.
We investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings.
arXiv Detail & Related papers (2023-03-30T17:57:43Z) - ConStruct-VL: Data-Free Continual Structured VL Concepts Learning [57.86651057895222]
We introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark.
We propose a data-free method comprised of a new approach of Adrial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models.
We show this approach outperforms all data-free methods by as much as 7% while even matching some levels of experience-replay.
arXiv Detail & Related papers (2022-11-17T18:57:03Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.