Visually Grounded Continual Learning of Compositional Phrases
- URL: http://arxiv.org/abs/2005.00785v5
- Date: Tue, 17 Nov 2020 03:12:11 GMT
- Title: Visually Grounded Continual Learning of Compositional Phrases
- Authors: Xisen Jin, Junyi Du, Arka Sadhu, Ram Nevatia, Xiang Ren
- Abstract summary: VisCOLL simulates the continual acquisition of compositional phrases from streaming visual scenes.
Models are trained on a paired image-caption stream which has shifting object distribution.
They are constantly evaluated by a visually-grounded masked language prediction task on held-out test sets.
- Score: 45.60521849859337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans acquire language continually with much more limited access to data
samples at a time, as compared to contemporary NLP systems. To study this
human-like language acquisition ability, we present VisCOLL, a visually
grounded language learning task, which simulates the continual acquisition of
compositional phrases from streaming visual scenes. In the task, models are
trained on a paired image-caption stream which has shifting object
distribution; while being constantly evaluated by a visually-grounded masked
language prediction task on held-out test sets. VisCOLL compounds the
challenges of continual learning (i.e., learning from continuously shifting
data distribution) and compositional generalization (i.e., generalizing to
novel compositions). To facilitate research on VisCOLL, we construct two
datasets, COCO-shift and Flickr-shift, and benchmark them using different
continual learning methods. Results reveal that SoTA continual learning
approaches provide little to no improvements on VisCOLL, since storing examples
of all possible compositions is infeasible. We conduct further ablations and
analysis to guide future work.
Related papers
- In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model [13.983810804606264]
We propose In-Context Prompt Learning (InCPL) for test-time visual recognition tasks.
InCPL associates a new test sample with very few labeled examples as context information.
We introduce a context-aware unsupervised loss to optimize visual prompts tailored to test samples.
arXiv Detail & Related papers (2024-03-10T08:15:51Z) - SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by
Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning.
It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone.
It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z) - Continual Contrastive Spoken Language Understanding [33.09005399967931]
COCONUT is a class-incremental learning (CIL) method that relies on the combination of experience replay and contrastive learning.
We show that COCONUT can be combined with methods that operate on the decoder side of the model, resulting in further metrics improvements.
arXiv Detail & Related papers (2023-10-04T10:09:12Z) - SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z) - OmDet: Large-scale vision-language multi-dataset pre-training with
multimodal detection network [17.980765138522322]
This work introduces OmDet, a novel language-aware object detection architecture.
Leveraging natural language as a universal knowledge representation, OmDet accumulates a "visual vocabulary" from diverse datasets.
We demonstrate superior performance of OmDet over strong baselines in object detection in the wild, open-vocabulary detection, and phrase grounding.
arXiv Detail & Related papers (2022-09-10T14:25:14Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Towards Lifelong Learning of Multilingual Text-To-Speech Synthesis [87.75833205560406]
This work presents a lifelong learning approach to train a multilingual Text-To-Speech (TTS) system.
It does not require pooled data from all languages altogether, and thus alleviates the storage and computation burden.
arXiv Detail & Related papers (2021-10-09T07:00:38Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Meta-Learning with Sparse Experience Replay for Lifelong Language
Learning [26.296412053816233]
We propose a novel approach to lifelong learning of language tasks based on meta-learning with sparse experience replay.
We show that under the realistic setting of performing a single pass on a stream of tasks, our method obtains state-of-the-art results on lifelong text classification and relation extraction.
arXiv Detail & Related papers (2020-09-10T14:36:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.