Related papers: GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

URL: http://arxiv.org/abs/2305.14676v1
Date: Wed, 24 May 2023 03:33:21 GMT
Title: GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions
Authors: Woojeong Jin, Subhabrata Mukherjee, Yu Cheng, Yelong Shen, Weizhu Chen, Ahmed Hassan Awadallah, Damien Jose, Xiang Ren
Abstract summary: Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks. We introduce GRILL, a novel VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks with no or very few training instances.
Score: 92.96783800362886
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks. However, such generalization to vision-language tasks including grounding and generation tasks has been under-explored; existing few-shot VL models struggle to handle tasks that involve object grounding and multiple images such as visual commonsense reasoning or NLVR2. In this paper, we introduce GRILL, GRounded vIsion Language aLigning, a novel VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks with no or very few training instances. Specifically, GRILL learns object grounding and localization by exploiting object-text alignments, which enables it to transfer to grounding tasks in a zero-/few-shot fashion. We evaluate our model on various zero-/few-shot VL tasks and show that it consistently surpasses the state-of-the-art few-shot methods.

Related papers

GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing [33.19843463374473]
Vision-Language Models (VLMs) in remote sensing have demonstrated significant potential in traditional tasks. Current models, which excel in Referring Expression (REC), struggle with tasks involving complex instructions. We introduce the Remote Sensing Vision-Language Task Set (RSVLTS), which includes Open-Vocabulary Tasks (OVT), Referring Expression Tasks (RET), and Described Object Tasks (DOT) We propose a novel unified data representation using a set-of-points approach for RSVLTS, along with a condition and a self-augmentation strategy based on cyclic referring.
arXiv Detail & Related papers (2025-03-16T12:48:17Z)
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization [77.36122979882649]
Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP) In this paper, we explore the idea that CV adopts discrete and terminological task definitions, which may be a key barrier to zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks--due to these terminological definitions--deep models struggle to generalize to novel tasks.
arXiv Detail & Related papers (2024-12-24T16:08:25Z)
Bridging Environments and Language with Rendering Functions and Vision-Language Models [7.704773649029078]
Vision-language models (VLMs) have tremendous potential for grounding language. This paper introduces a novel decomposition of the problem of building language-conditioned agents (LCAs) We also explore several enhancements to the speed and quality of VLM-based LCAs.
arXiv Detail & Related papers (2024-09-24T12:24:07Z)
OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors. This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training. Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z)
UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions [64.50935101415776]
We build a single model that jointly performs various spoken language understanding (SLU) tasks. We demonstrate the efficacy of our single multi-task learning model "UniverSLU" for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages.
arXiv Detail & Related papers (2023-10-04T17:10:23Z)
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding [88.24517460894634]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning. Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z)
Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts [75.75548749888029]
We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks. With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.
arXiv Detail & Related papers (2023-05-11T17:57:49Z)
Prompt Tuning with Soft Context Sharing for Vision-Language Models [42.61889428498378]
We propose a novel method to tune pre-trained vision-language models on multiple target few-shot tasks jointly. We show that SoftCPT significantly outperforms single-task prompt tuning methods.
arXiv Detail & Related papers (2022-08-29T10:19:10Z)
GLIPv2: Unifying Localization and Vision-Language Understanding [161.1770269829139]
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks and Vision-Language (VL) understanding tasks. GLIPv2 unifies localization pre-training and Vision-Language Pre-training with three pre-training tasks. We show that a single GLIPv2 model achieves near SoTA performance on various localization and understanding tasks.
arXiv Detail & Related papers (2022-06-12T20:31:28Z)
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering [43.07139534653485]
We present Answer-Me, a task-aware multi-task framework. We pre-train a vision-language joint model, which is multi-task as well. Results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results.
arXiv Detail & Related papers (2022-05-02T14:53:13Z)
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation [86.26522210882699]
We propose Unified multimodal pre-training for both Vision-Language understanding and generation. The proposed UniVL is capable of handling both understanding tasks and generative tasks. Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model.
arXiv Detail & Related papers (2021-12-10T14:59:06Z)
Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture. Our models learn to generate labels in text based on the visual and textual inputs. Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.