Related papers: Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles

Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles

URL: http://arxiv.org/abs/2211.16504v1
Date: Tue, 29 Nov 2022 18:59:59 GMT
Title: Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles
Authors: Shuquan Ye and Yujia Xie and Dongdong Chen and Yichong Xu and Lu Yuan and Chenguang Zhu and Jing Liao
Abstract summary: This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models. We propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE) For better commonsense evaluation, we propose the first retrieval-based commonsense diagnostic benchmark.
Score: 83.41551911845157
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models. Despite the great success, we observe that existing VL-models still lack commonsense knowledge/reasoning ability (e.g., "Lemons are sour"), which is a vital component towards artificial general intelligence. Through our analysis, we find one important reason is that existing large-scale VL datasets do not contain much commonsense knowledge, which motivates us to improve the commonsense of VL-models from the data perspective. Rather than collecting a new VL training dataset, we propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE). It can be viewed as one type of data augmentation technique, which can inject commonsense knowledge into existing VL datasets on the fly during training. More specifically, we leverage the commonsense knowledge graph (e.g., ConceptNet) and create variants of text description in VL datasets via bidirectional sub-graph sequentialization. For better commonsense evaluation, we further propose the first retrieval-based commonsense diagnostic benchmark. By conducting extensive experiments on some representative VL-models, we demonstrate that our DANCE technique is able to significantly improve the commonsense ability while maintaining the performance on vanilla retrieval tasks. The code and data are available at https://github.com/pleaseconnectwifi/DANCE

Related papers

Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods. MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections. Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z)
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases [87.65903426052155]
We perform a large-scale transfer learning experiment aimed at discovering latent vision-language skills from data. We show that generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths. We present a new dataset, OLIVE, which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested.
arXiv Detail & Related papers (2024-04-03T02:40:35Z)
Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z)
Going Beyond Nouns With Vision & Language Models Using Synthetic Data [43.87754926411406]
Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications. Recent works have uncovered a fundamental weakness of these models. We investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings.
arXiv Detail & Related papers (2023-03-30T17:57:43Z)
Teaching Structured Vision&Language Concepts to Vision&Language Models [46.344585368641006]
We introduce the collective notion of Structured Vision&Language Concepts (SVLC) SVLC includes object attributes, relations, and states which are present in the text and visible in the image. We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
arXiv Detail & Related papers (2022-11-21T18:54:10Z)
A survey on knowledge-enhanced multimodal learning [1.8591405259852054]
Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other.
arXiv Detail & Related papers (2022-11-19T14:00:50Z)
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning [57.86651057895222]
We introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark. We propose a data-free method comprised of a new approach of Adrial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models. We show this approach outperforms all data-free methods by as much as 7% while even matching some levels of experience-replay.
arXiv Detail & Related papers (2022-11-17T18:57:03Z)
VinVL: Revisiting Visual Representations in Vision-Language Models [96.39332942534368]
We develop an improved object detection model to provide object-centric representations of images. New visual features significantly improve the performance across all vision language (VL) tasks. We will release the new object detection model to public.
arXiv Detail & Related papers (2021-01-02T23:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.