Intelligent Image Search Algorithms Fusing Visual Large Models
- URL: http://arxiv.org/abs/2511.19920v1
- Date: Tue, 25 Nov 2025 04:54:49 GMT
- Title: Intelligent Image Search Algorithms Fusing Visual Large Models
- Authors: Kehan Wang, Tingqiong Cui, Yang Zhang, Yu Chen, Shifeng Wu, Zhenzhang Li,
- Abstract summary: This paper proposes DetVLM, a novel intelligent image search framework that fuses object detection with Visual Large Models.<n>Experiments on a vehicle component dataset show DetVLM achieves a state-of-the-art overall retrieval accuracy of 94.82%.<n>It also attains 94.95% accuracy in zero-shot search for driver mask-wearing and over 90% average accuracy in state search tasks.
- Score: 6.560867914001199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-grained image retrieval, which aims to find images containing specific object components and assess their detailed states, is critical in fields like security and industrial inspection. However, conventional methods face significant limitations: manual features (e.g., SIFT) lack robustness; deep learning-based detectors (e.g., YOLO) can identify component presence but cannot perform state-specific retrieval or zero-shot search; Visual Large Models (VLMs) offer semantic and zero-shot capabilities but suffer from poor spatial grounding and high computational cost, making them inefficient for direct retrieval. To bridge these gaps, this paper proposes DetVLM, a novel intelligent image search framework that synergistically fuses object detection with VLMs. The framework pioneers a search-enhancement paradigm via a two-stage pipeline: a YOLO detector first conducts efficient, high-recall component-level screening to determine component presence; then, a VLM acts as a recall-enhancement unit, performing secondary verification for components missed by the detector. This architecture directly enables two advanced capabilities: 1) State Search: Guided by task-specific prompts, the VLM refines results by verifying component existence and executing sophisticated state judgments (e.g., "sun visor lowered"), allowing retrieval based on component state. 2) Zero-shot Search: The framework leverages the VLM's inherent zero-shot capability to recognize and retrieve images containing unseen components or attributes (e.g., "driver wearing a mask") without any task-specific training. Experiments on a vehicle component dataset show DetVLM achieves a state-of-the-art overall retrieval accuracy of 94.82\%, significantly outperforming detection-only baselines. It also attains 94.95\% accuracy in zero-shot search for driver mask-wearing and over 90\% average accuracy in state search tasks.
Related papers
- Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition [71.5328300638085]
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions.<n>Existing methods, including two-stage methods, tightly couple interaction recognition with a specific detector.<n>We propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR.
arXiv Detail & Related papers (2026-02-16T19:01:31Z) - Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models [58.46663983451155]
PixSearch is an end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning.<n>During encoding, PixSearch emits search> tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries.<n>On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization.
arXiv Detail & Related papers (2026-01-27T00:46:08Z) - WeDetect: Fast Open-Vocabulary Object Detection as Retrieval [74.39703419628829]
Open-vocabulary object detection aims to detect arbitrary classes via text prompts.<n> Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem.<n>We develop a model family named WeDetect to achieve state-of-the-art performance across 15 benchmarks with high inference efficiency.
arXiv Detail & Related papers (2025-12-13T12:40:28Z) - X2-DFD: A framework for eXplainable and eXtendable Deepfake Detection [55.77552681618732]
X2-DFD is an eXplainable and eXtendable framework based on multimodal large-language models (MLLMs) for deepfake detection.<n>The first stage, Model Feature Assessment, systematically evaluates the detectability of forgery-related features for the MLLM.<n>The second stage, Explainable dataset Construction, consists of two key modules: Strong Feature Strengthening and Weak Feature Supplementing.<n>The third stage, Fine-tuning and Inference, involves fine-tuning the MLLM on the constructed dataset and deploying it for final detection and explanation.
arXiv Detail & Related papers (2024-10-08T15:28:33Z) - Few-shot Oriented Object Detection with Memorable Contrastive Learning in Remote Sensing Images [11.217630579076237]
Few-shot object detection (FSOD) has garnered significant research attention in the field of remote sensing.
We propose a novel FSOD method for remote sensing images called Few-shot Oriented object detection with Memorable Contrastive learning (FOMC)
Specifically, we employ oriented bounding boxes instead of traditional horizontal bounding boxes to learn a better feature representation for arbitrary-oriented aerial objects.
arXiv Detail & Related papers (2024-03-20T08:15:18Z) - Exploring Robust Features for Few-Shot Object Detection in Satellite
Imagery [17.156864650143678]
We develop a few-shot object detector based on a traditional two-stage architecture.
A large-scale pre-trained model is used to build class-reference embeddings or prototypes.
We perform evaluations on two remote sensing datasets containing challenging and rare objects.
arXiv Detail & Related papers (2024-03-08T15:20:27Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z) - Query-based Hard-Image Retrieval for Object Detection at Test Time [10.63460618121976]
We reformulate the problem of finding "hard" images as a query-based hard image retrieval task.
Our method is entirely post-hoc, does not require ground-truth annotations, and relies on an efficient Monte Carlo estimation.
We provide results on ranking and classification tasks using the widely used RetinaNet, Faster-RCNN, Mask-RCNN, and Cascade Mask-RCNN object detectors.
arXiv Detail & Related papers (2022-09-23T12:33:31Z) - Activation to Saliency: Forming High-Quality Labels for Unsupervised
Salient Object Detection [54.92703325989853]
We propose a two-stage Activation-to-Saliency (A2S) framework that effectively generates high-quality saliency cues.
No human annotations are involved in our framework during the whole training process.
Our framework reports significant performance compared with existing USOD methods.
arXiv Detail & Related papers (2021-12-07T11:54:06Z) - One-Shot Object Affordance Detection in the Wild [76.46484684007706]
Affordance detection refers to identifying the potential action possibilities of objects in an image.
We devise a One-Shot Affordance Detection Network (OSAD-Net) that estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images.
With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods.
arXiv Detail & Related papers (2021-08-08T14:53:10Z) - Should I Look at the Head or the Tail? Dual-awareness Attention for
Few-Shot Object Detection [20.439719842851744]
We propose a novel Dual-Awareness-Attention (DAnA), which captures the pairwise spatial relationship cross the support and query images.
Our DAnA component is adaptable to various existing object detection networks and boosts FSOD performance by paying attention to specific semantics.
Experimental results demonstrate that DAnA significantly boosts (48% and 125% relatively) object detection performance on the COCO benchmark.
arXiv Detail & Related papers (2021-02-24T09:17:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.