LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension
- URL: http://arxiv.org/abs/2409.11919v2
- Date: Tue, 15 Oct 2024 14:52:55 GMT
- Title: LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension
- Authors: Amaia Cardiel, Eloi Zablocki, Elias Ramzi, Oriane Siméoni, Matthieu Cord,
- Abstract summary: We propose a method for 'black-box' adaptation of Vision Language Models (VLMs) for the Referring Expression (REC) task using Large Language Models (LLMs)
LLMs capitalizes on the reasoning abilities of LLMs, improved with a light fine-tuning, to select the most relevant bounding box matching the referring expression.
Our approach offers several advantages: it enables the adaptation of closed-source models without needing access to their internal workings.
- Score: 45.856469849910496
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision Language Models (VLMs) have demonstrated remarkable capabilities in various open-vocabulary tasks, yet their zero-shot performance lags behind task-specific finetuned models, particularly in complex tasks like Referring Expression Comprehension (REC). Fine-tuning usually requires 'white-box' access to the model's architecture and weights, which is not always feasible due to proprietary or privacy concerns. In this work, we propose LLM-wrapper, a method for 'black-box' adaptation of VLMs for the REC task using Large Language Models (LLMs). LLM-wrapper capitalizes on the reasoning abilities of LLMs, improved with a light fine-tuning, to select the most relevant bounding box matching the referring expression, from candidates generated by a zero-shot black-box VLM. Our approach offers several advantages: it enables the adaptation of closed-source models without needing access to their internal workings, it is versatile as it works with any VLM, it transfers to new VLMs, and it allows for the adaptation of an ensemble of VLMs. We evaluate LLM-wrapper on multiple datasets using different VLMs and LLMs, demonstrating significant performance improvements and highlighting the versatility of our method. While LLM-wrapper is not meant to directly compete with standard white-box fine-tuning, it offers a practical and effective alternative for black-box VLM adaptation. The code will be open-sourced.
Related papers
- Matryoshka: Learning to Drive Black-Box LLMs with LLMs [31.501244808646]
Matryoshika is a lightweight white-box large language models controller.
It guides a large-scale black-box LLM generator by decomposing complex tasks into a series of intermediate outputs.
arXiv Detail & Related papers (2024-10-28T05:28:51Z) - GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models [44.82179903133343]
Large Language Models (LLMs) act as implicits for Vision-Langugage Models (VLMs)
Our GLOV meta-prompts an LLM with the downstream task description, querying it for suitable VLM prompts.
We evaluate our GLOV on 16 diverse datasets using two families of VLMs.
arXiv Detail & Related papers (2024-10-08T15:55:40Z) - SoupLM: Model Integration in Large Language and Multi-Modal Models [51.12227693121004]
Training large language models (LLMs) requires significant computing resources.
Existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks.
arXiv Detail & Related papers (2024-07-11T05:38:15Z) - Parrot: Efficient Serving of LLM-based Applications with Semantic Variable [11.894203842968745]
Parrot is a service system that focuses on the end-to-end experience of LLM-based applications.
A Semantic Variable annotates an input/output variable in the prompt of a request, and creates the data pipeline when connecting multiple LLM requests.
arXiv Detail & Related papers (2024-05-30T09:46:36Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z) - InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest.
This work delves into enabling LLMs to tackle more vision-language-related tasks.
InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z) - Language Models as Black-Box Optimizers for Vision-Language Models [62.80817942316398]
Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data.
We aim to develop a black-box approach to optimize VLMs through natural language prompts.
arXiv Detail & Related papers (2023-09-12T04:03:41Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.