Improving Commonsense in Vision-Language Models via Knowledge Graph
Riddles
- URL: http://arxiv.org/abs/2211.16504v1
- Date: Tue, 29 Nov 2022 18:59:59 GMT
- Title: Improving Commonsense in Vision-Language Models via Knowledge Graph
Riddles
- Authors: Shuquan Ye and Yujia Xie and Dongdong Chen and Yichong Xu and Lu Yuan
and Chenguang Zhu and Jing Liao
- Abstract summary: This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models.
We propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE)
For better commonsense evaluation, we propose the first retrieval-based commonsense diagnostic benchmark.
- Score: 83.41551911845157
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper focuses on analyzing and improving the commonsense ability of
recent popular vision-language (VL) models. Despite the great success, we
observe that existing VL-models still lack commonsense knowledge/reasoning
ability (e.g., "Lemons are sour"), which is a vital component towards
artificial general intelligence. Through our analysis, we find one important
reason is that existing large-scale VL datasets do not contain much commonsense
knowledge, which motivates us to improve the commonsense of VL-models from the
data perspective. Rather than collecting a new VL training dataset, we propose
a more scalable strategy, i.e., "Data Augmentation with kNowledge graph
linearization for CommonsensE capability" (DANCE). It can be viewed as one type
of data augmentation technique, which can inject commonsense knowledge into
existing VL datasets on the fly during training. More specifically, we leverage
the commonsense knowledge graph (e.g., ConceptNet) and create variants of text
description in VL datasets via bidirectional sub-graph sequentialization. For
better commonsense evaluation, we further propose the first retrieval-based
commonsense diagnostic benchmark. By conducting extensive experiments on some
representative VL-models, we demonstrate that our DANCE technique is able to
significantly improve the commonsense ability while maintaining the performance
on vanilla retrieval tasks. The code and data are available at
https://github.com/pleaseconnectwifi/DANCE
Related papers
- What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases [87.65903426052155]
We perform a large-scale transfer learning experiment aimed at discovering latent vision-language skills from data.
We show that generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths.
We present a new dataset, OLIVE, which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested.
arXiv Detail & Related papers (2024-04-03T02:40:35Z) - Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - Going Beyond Nouns With Vision & Language Models Using Synthetic Data [43.87754926411406]
Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications.
Recent works have uncovered a fundamental weakness of these models.
We investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings.
arXiv Detail & Related papers (2023-03-30T17:57:43Z) - Teaching Structured Vision&Language Concepts to Vision&Language Models [46.344585368641006]
We introduce the collective notion of Structured Vision&Language Concepts (SVLC)
SVLC includes object attributes, relations, and states which are present in the text and visible in the image.
We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
arXiv Detail & Related papers (2022-11-21T18:54:10Z) - A survey on knowledge-enhanced multimodal learning [1.8591405259852054]
Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.
Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text.
VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other.
arXiv Detail & Related papers (2022-11-19T14:00:50Z) - ConStruct-VL: Data-Free Continual Structured VL Concepts Learning [57.86651057895222]
We introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark.
We propose a data-free method comprised of a new approach of Adrial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models.
We show this approach outperforms all data-free methods by as much as 7% while even matching some levels of experience-replay.
arXiv Detail & Related papers (2022-11-17T18:57:03Z) - e-ViL: A Dataset and Benchmark for Natural Language Explanations in
Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks.
We also introduce e-SNLI-VE, the largest existing dataset with NLEs.
We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z) - VinVL: Revisiting Visual Representations in Vision-Language Models [96.39332942534368]
We develop an improved object detection model to provide object-centric representations of images.
New visual features significantly improve the performance across all vision language (VL) tasks.
We will release the new object detection model to public.
arXiv Detail & Related papers (2021-01-02T23:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.