GRILL: Grounded Vision-language Pre-training via Aligning Text and Image
Regions
- URL: http://arxiv.org/abs/2305.14676v1
- Date: Wed, 24 May 2023 03:33:21 GMT
- Title: GRILL: Grounded Vision-language Pre-training via Aligning Text and Image
Regions
- Authors: Woojeong Jin, Subhabrata Mukherjee, Yu Cheng, Yelong Shen, Weizhu
Chen, Ahmed Hassan Awadallah, Damien Jose, Xiang Ren
- Abstract summary: Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks.
We introduce GRILL, a novel VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks with no or very few training instances.
- Score: 92.96783800362886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalization to unseen tasks is an important ability for few-shot learners
to achieve better zero-/few-shot performance on diverse tasks. However, such
generalization to vision-language tasks including grounding and generation
tasks has been under-explored; existing few-shot VL models struggle to handle
tasks that involve object grounding and multiple images such as visual
commonsense reasoning or NLVR2. In this paper, we introduce GRILL, GRounded
vIsion Language aLigning, a novel VL model that can be generalized to diverse
tasks including visual question answering, captioning, and grounding tasks with
no or very few training instances. Specifically, GRILL learns object grounding
and localization by exploiting object-text alignments, which enables it to
transfer to grounding tasks in a zero-/few-shot fashion. We evaluate our model
on various zero-/few-shot VL tasks and show that it consistently surpasses the
state-of-the-art few-shot methods.
Related papers
- Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization [77.36122979882649]
Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP)
In this paper, we explore the idea that CV adopts discrete and terminological task definitions, which may be a key barrier to zero-shot task generalization.
Our hypothesis is that without truly understanding previously-seen tasks--due to these terminological definitions--deep models struggle to generalize to novel tasks.
arXiv Detail & Related papers (2024-12-24T16:08:25Z) - From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models [7.704773649029078]
Vision-language models (VLMs) have tremendous potential for grounding language.
This paper introduces a novel decomposition of the problem of building language-conditioned agents (LCAs)
We also explore several enhancements to the speed and quality of VLM-based LCAs.
arXiv Detail & Related papers (2024-09-24T12:24:07Z) - OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors.
This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training.
Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z) - UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions [64.50935101415776]
We build a single model that jointly performs various spoken language understanding (SLU) tasks.
We demonstrate the efficacy of our single multi-task learning model "UniverSLU" for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages.
arXiv Detail & Related papers (2023-10-04T17:10:23Z) - Prompt Tuning with Soft Context Sharing for Vision-Language Models [42.61889428498378]
We propose a novel method to tune pre-trained vision-language models on multiple target few-shot tasks jointly.
We show that SoftCPT significantly outperforms single-task prompt tuning methods.
arXiv Detail & Related papers (2022-08-29T10:19:10Z) - GLIPv2: Unifying Localization and Vision-Language Understanding [161.1770269829139]
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks and Vision-Language (VL) understanding tasks.
GLIPv2 unifies localization pre-training and Vision-Language Pre-training with three pre-training tasks.
We show that a single GLIPv2 model achieves near SoTA performance on various localization and understanding tasks.
arXiv Detail & Related papers (2022-06-12T20:31:28Z) - Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering [43.07139534653485]
We present Answer-Me, a task-aware multi-task framework.
We pre-train a vision-language joint model, which is multi-task as well.
Results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results.
arXiv Detail & Related papers (2022-05-02T14:53:13Z) - Unified Multimodal Pre-training and Prompt-based Tuning for
Vision-Language Understanding and Generation [86.26522210882699]
We propose Unified multimodal pre-training for both Vision-Language understanding and generation.
The proposed UniVL is capable of handling both understanding tasks and generative tasks.
Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model.
arXiv Detail & Related papers (2021-12-10T14:59:06Z) - Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture.
Our models learn to generate labels in text based on the visual and textual inputs.
Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.