Going Beyond Nouns With Vision & Language Models Using Synthetic Data
- URL: http://arxiv.org/abs/2303.17590v2
- Date: Wed, 30 Aug 2023 17:46:17 GMT
- Title: Going Beyond Nouns With Vision & Language Models Using Synthetic Data
- Authors: Paola Cascante-Bonilla, Khaled Shehada, James Seale Smith, Sivan
Doveh, Donghyun Kim, Rameswar Panda, G\"ul Varol, Aude Oliva, Vicente
Ordonez, Rogerio Feris, Leonid Karlinsky
- Abstract summary: Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications.
Recent works have uncovered a fundamental weakness of these models.
We investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings.
- Score: 43.87754926411406
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale pre-trained Vision & Language (VL) models have shown remarkable
performance in many applications, enabling replacing a fixed set of supported
classes with zero-shot open vocabulary reasoning over (almost arbitrary)
natural language prompts. However, recent works have uncovered a fundamental
weakness of these models. For example, their difficulty to understand Visual
Language Concepts (VLC) that go 'beyond nouns' such as the meaning of
non-object words (e.g., attributes, actions, relations, states, etc.), or
difficulty in performing compositional reasoning such as understanding the
significance of the order of the words in a sentence. In this work, we
investigate to which extent purely synthetic data could be leveraged to teach
these models to overcome such shortcomings without compromising their zero-shot
capabilities. We contribute Synthetic Visual Concepts (SyViC) - a million-scale
synthetic dataset and data generation codebase allowing to generate additional
suitable data to improve VLC understanding and compositional reasoning of VL
models. Additionally, we propose a general VL finetuning strategy for
effectively leveraging SyViC towards achieving these improvements. Our
extensive experiments and ablations on VL-Checklist, Winoground, and ARO
benchmarks demonstrate that it is possible to adapt strong pre-trained VL
models with synthetic data significantly enhancing their VLC understanding
(e.g. by 9.9% on ARO and 4.3% on VL-Checklist) with under 1% drop in their
zero-shot accuracy.
Related papers
- Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Improving Commonsense in Vision-Language Models via Knowledge Graph
Riddles [83.41551911845157]
This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models.
We propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE)
For better commonsense evaluation, we propose the first retrieval-based commonsense diagnostic benchmark.
arXiv Detail & Related papers (2022-11-29T18:59:59Z) - Teaching Structured Vision&Language Concepts to Vision&Language Models [46.344585368641006]
We introduce the collective notion of Structured Vision&Language Concepts (SVLC)
SVLC includes object attributes, relations, and states which are present in the text and visible in the image.
We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
arXiv Detail & Related papers (2022-11-21T18:54:10Z) - ConStruct-VL: Data-Free Continual Structured VL Concepts Learning [57.86651057895222]
We introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark.
We propose a data-free method comprised of a new approach of Adrial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models.
We show this approach outperforms all data-free methods by as much as 7% while even matching some levels of experience-replay.
arXiv Detail & Related papers (2022-11-17T18:57:03Z) - TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating
Visio-Linguistic Reasoning [25.520406167426135]
We present TraVLR, a synthetic dataset comprising four visio-linguistic (V+L) reasoning tasks.
Each example in TraVLR redundantly encodes the scene in two modalities, allowing either to be dropped or added during training or testing without losing relevant information.
We compare the performance of four state-of-the-art V+L models, finding that while they perform well on test examples from the same modality, they all fail at cross-modal transfer.
arXiv Detail & Related papers (2021-11-21T07:22:44Z) - e-ViL: A Dataset and Benchmark for Natural Language Explanations in
Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks.
We also introduce e-SNLI-VE, the largest existing dataset with NLEs.
We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.