Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2510.18502v1
- Date: Tue, 21 Oct 2025 10:39:39 GMT
- Title: Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
- Authors: Wei-Chia Chang, Yan-Ann Chen,
- Abstract summary: We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition.<n>A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features.<n>This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles.
- Score: 1.2031796234206136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.
Related papers
- Co-Training Vision Language Models for Remote Sensing Multi-task Learning [68.15604397741753]
Vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning.<n>We present RSCoVLM, a simple yet flexible VLM baseline for RS MTL.<n>We propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery.
arXiv Detail & Related papers (2025-11-26T10:55:07Z) - ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z) - Multimodal RAG Enhanced Visual Description [3.2771631221674333]
Pre-trained large multimodal models (LMMs) encounter a modality gap, characterised by a misalignment between textual and visual representations.<n>We propose a lightweight training-free approach utilising Retrieval-Augmented Generation (RAG) to extend across the modality.<n> Experimental results on two benchmark multimodal datasets demonstrate significant improvements.
arXiv Detail & Related papers (2025-08-06T19:04:38Z) - Multimodal LLMs as Customized Reward Models for Text-to-Image Generation [60.164968941945645]
We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives.<n>LLaVA-Reward directly utilizes the hidden states of multimodal large language models (MLLMs)<n>We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking.
arXiv Detail & Related papers (2025-07-28T23:52:53Z) - Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models [0.6715525121432597]
This research presents a novel vision language model (VLM) framework to enhance feature extraction, scalability, and efficiency.<n>We evaluate the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise.<n>Our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV)
arXiv Detail & Related papers (2025-03-08T01:22:10Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Calibrated Self-Rewarding Vision Language Models [27.686545023186852]
Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning.
LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image.
We propose the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning.
arXiv Detail & Related papers (2024-05-23T14:30:33Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language
Recognition with Variational Alignment [42.10603331311837]
Sign language recognition ( SLR) is a weakly supervised task that annotates sign videos as textual glosses.
Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR.
We propose a novel contrastive visual transformation for SLR,- SLR, to fully explore the pretrained knowledge of both the visual and language modalities.
arXiv Detail & Related papers (2023-03-10T06:12:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.