ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
- URL: http://arxiv.org/abs/2211.09790v2
- Date: Thu, 30 Mar 2023 17:59:16 GMT
- Title: ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
- Authors: James Seale Smith, Paola Cascante-Bonilla, Assaf Arbelle, Donghyun
Kim, Rameswar Panda, David Cox, Diyi Yang, Zsolt Kira, Rogerio Feris, Leonid
Karlinsky
- Abstract summary: We introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark.
We propose a data-free method comprised of a new approach of Adrial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models.
We show this approach outperforms all data-free methods by as much as 7% while even matching some levels of experience-replay.
- Score: 57.86651057895222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, large-scale pre-trained Vision-and-Language (VL) foundation models
have demonstrated remarkable capabilities in many zero-shot downstream tasks,
achieving competitive results for recognizing objects defined by as little as
short text prompts. However, it has also been shown that VL models are still
brittle in Structured VL Concept (SVLC) reasoning, such as the ability to
recognize object attributes, states, and inter-object relations. This leads to
reasoning mistakes, which need to be corrected as they occur by teaching VL
models the missing SVLC skills; often this must be done using private data
where the issue was found, which naturally leads to a data-free continual (no
task-id) VL learning setting. In this work, we introduce the first Continual
Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark and show it
is challenging for many existing data-free CL strategies. We, therefore,
propose a data-free method comprised of a new approach of Adversarial
Pseudo-Replay (APR) which generates adversarial reminders of past tasks from
past task models. To use this method efficiently, we also propose a continual
parameter-efficient Layered-LoRA (LaLo) neural architecture allowing
no-memory-cost access to all past models at train time. We show this approach
outperforms all data-free methods by as much as ~7% while even matching some
levels of experience-replay (prohibitive for applications where data-privacy
must be preserved). Our code is publicly available at
https://github.com/jamessealesmith/ConStruct-VL
Related papers
- Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - Membership Inference Attacks against Large Vision-Language Models [40.996912464828696]
Large vision-language models (VLLMs) exhibit promising capabilities for processing multi-modal tasks across various application scenarios.
Their emergence also raises significant data security concerns, given the potential inclusion of sensitive information, such as private photos and medical records.
Detecting inappropriately used data in VLLMs remains a critical and unresolved issue.
arXiv Detail & Related papers (2024-11-05T08:35:08Z) - CTP: Towards Vision-Language Continual Pretraining via Compatible
Momentum Contrast and Topology Preservation [128.00940554196976]
Vision-Language Continual Pretraining (VLCP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets.
To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D.
The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data.
arXiv Detail & Related papers (2023-08-14T13:53:18Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Going Beyond Nouns With Vision & Language Models Using Synthetic Data [43.87754926411406]
Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications.
Recent works have uncovered a fundamental weakness of these models.
We investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings.
arXiv Detail & Related papers (2023-03-30T17:57:43Z) - Improving Commonsense in Vision-Language Models via Knowledge Graph
Riddles [83.41551911845157]
This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models.
We propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE)
For better commonsense evaluation, we propose the first retrieval-based commonsense diagnostic benchmark.
arXiv Detail & Related papers (2022-11-29T18:59:59Z) - Teaching Structured Vision&Language Concepts to Vision&Language Models [46.344585368641006]
We introduce the collective notion of Structured Vision&Language Concepts (SVLC)
SVLC includes object attributes, relations, and states which are present in the text and visible in the image.
We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
arXiv Detail & Related papers (2022-11-21T18:54:10Z) - Task Residual for Tuning Vision-Language Models [69.22958802711017]
We propose a new efficient tuning approach for vision-language models (VLMs) named Task Residual Tuning (TaskRes)
TaskRes explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.
The proposed TaskRes is simple yet effective, which significantly outperforms previous methods on 11 benchmark datasets.
arXiv Detail & Related papers (2022-11-18T15:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.