QueryAdapter: Rapid Adaptation of Vision-Language Models in Response to Natural Language Queries
- URL: http://arxiv.org/abs/2502.18735v1
- Date: Wed, 26 Feb 2025 01:07:28 GMT
- Title: QueryAdapter: Rapid Adaptation of Vision-Language Models in Response to Natural Language Queries
- Authors: Nicolas Harvey Chapman, Feras Dayoub, Will Browne, Christopher Lehnert,
- Abstract summary: We present a novel framework for rapidly adapting a pre-trained VLM to respond to a natural language query.<n>We use unlabelled data collected during previous deployments to align VLM features with semantic classes related to the query.<n>We also explore how objects unrelated to the query should be dealt with when using real-world data for adaptation.
- Score: 2.306164598536725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A domain shift exists between the large-scale, internet data used to train a Vision-Language Model (VLM) and the raw image streams collected by a robot. Existing adaptation strategies require the definition of a closed-set of classes, which is impractical for a robot that must respond to diverse natural language queries. In response, we present QueryAdapter; a novel framework for rapidly adapting a pre-trained VLM in response to a natural language query. QueryAdapter leverages unlabelled data collected during previous deployments to align VLM features with semantic classes related to the query. By optimising learnable prompt tokens and actively selecting objects for training, an adapted model can be produced in a matter of minutes. We also explore how objects unrelated to the query should be dealt with when using real-world data for adaptation. In turn, we propose the use of object captions as negative class labels, helping to produce better calibrated confidence scores during adaptation. Extensive experiments on ScanNet++ demonstrate that QueryAdapter significantly enhances object retrieval performance compared to state-of-the-art unsupervised VLM adapters and 3D scene graph methods. Furthermore, the approach exhibits robust generalization to abstract affordance queries and other datasets, such as Ego4D.
Related papers
- QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder.
We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z) - Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation [72.28364940168092]
Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries.
We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation.
arXiv Detail & Related papers (2025-03-27T17:59:58Z) - GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing [33.19843463374473]
Vision-Language Models (VLMs) in remote sensing have demonstrated significant potential in traditional tasks.
Current models, which excel in Referring Expression (REC), struggle with tasks involving complex instructions.
We introduce the Remote Sensing Vision-Language Task Set (RSVLTS), which includes Open-Vocabulary Tasks (OVT), Referring Expression Tasks (RET), and Described Object Tasks (DOT)
We propose a novel unified data representation using a set-of-points approach for RSVLTS, along with a condition and a self-augmentation strategy based on cyclic referring.
arXiv Detail & Related papers (2025-03-16T12:48:17Z) - Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval [21.982906171156888]
Cross-lingual Cross-modal Retrieval aims to align vision and the low-resource language (the target language) without using any human-labeled target-language data.<n>We propose Dynamic Adapter with Semantics Disentangling (DASD), whose parameters are dynamically generated conditioned on the characteristics of the input captions.
arXiv Detail & Related papers (2024-12-18T05:19:09Z) - Large Vision-Language Models for Remote Sensing Visual Question Answering [0.0]
Remote Sensing Visual Question Answering (RSVQA) is a challenging task that involves interpreting complex satellite imagery to answer natural language questions.
Traditional approaches often rely on separate visual feature extractors and language processing models, which can be computationally intensive and limited in their ability to handle open-ended questions.
We propose a novel method that leverages a generative Large Vision-Language Model (LVLM) to streamline the RSVQA process.
arXiv Detail & Related papers (2024-11-16T18:32:38Z) - Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP [24.22470408549266]
We dub prompt embedding as Aggregate-and-Adapted Prompt Embedding (AAPE)
AAPE is shown to be able to generalize to different downstream data distributions and tasks, including vision-language understanding tasks.
We also show AAPE is particularly helpful to handle non-canonical and OOD examples.
arXiv Detail & Related papers (2024-10-31T07:41:13Z) - SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization [70.11167263638562]
Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images.
We first present a simple yet well-crafted framework named name, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework.
arXiv Detail & Related papers (2024-10-28T18:10:26Z) - GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video.
By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions.
GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning [10.731045939849125]
We focus on Text-to- semantic parsing from the perspective of retrieval-augmented generation.
Motivated by challenges related to the size of commercial database schemata and the deployability of business intelligence solutions, we propose $textASTReS$ that dynamically retrieves input database information.
arXiv Detail & Related papers (2024-07-03T15:55:14Z) - UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics.
We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - VLT: Vision-Language Transformer and Query Generation for Referring
Segmentation [31.051579752237746]
We propose a framework for referring segmentation to facilitate deep interactions among multi-modal information.
We introduce masked contrastive learning to narrow down the features of different expressions for the same target object.
The proposed approach is lightweight and achieves new state-of-the-art referring segmentation results consistently on five datasets.
arXiv Detail & Related papers (2022-10-28T03:36:07Z) - Query Resolution for Conversational Search with Limited Supervision [63.131221660019776]
We propose QuReTeC (Query Resolution by Term Classification), a neural query resolution model based on bidirectional transformers.
We show that QuReTeC outperforms state-of-the-art models, and furthermore, that our distant supervision method can be used to substantially reduce the amount of human-curated data required to train QuReTeC.
arXiv Detail & Related papers (2020-05-24T11:37:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.