Related papers: Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model

Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model

URL: http://arxiv.org/abs/2502.09057v1
Date: Thu, 13 Feb 2025 08:11:10 GMT
Title: Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model
Authors: Shiryu Ueno, Yoshikazu Hayashi, Shunsuke Nakatsuka, Yusei Yamada, Hiroaki Aizawa, Kunihito Kato,
Abstract summary: We propose general visual inspection model using Vision-Language Model(VLM) with few-shot images of non-defective or defective products.<n>For new products, our method employs In-Context Learning, which allows the model to perform inspections with an example of non-defective or defective image.
Score: 0.5497663232622965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose general visual inspection model using Vision-Language Model~(VLM) with few-shot images of non-defective or defective products, along with explanatory texts that serve as inspection criteria. Although existing VLM exhibit high performance across various tasks, they are not trained on specific tasks such as visual inspection. Thus, we construct a dataset consisting of diverse images of non-defective and defective products collected from the web, along with unified formatted output text, and fine-tune VLM. For new products, our method employs In-Context Learning, which allows the model to perform inspections with an example of non-defective or defective image and the corresponding explanatory texts with visual prompts. This approach eliminates the need to collect a large number of training samples and re-train the model for each product. The experimental results show that our method achieves high performance, with MCC of 0.804 and F1-score of 0.950 on MVTec AD in a one-shot manner. Our code is available at~https://github.com/ia-gu/Vision-Language-In-Context-Learning-Driven-Few-Shot-Visual-Inspection-Mo del.

Related papers

Demystifying the Visual Quality Paradox in Multimodal Large Language Models [49.154146792279946]
Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses.<n>We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks.<n>We uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity.
arXiv Detail & Related papers (2025-06-18T17:14:07Z)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
ABC: Achieving Better Control of Multimodal Embeddings using VLMs [61.396457715710774]
Visual embedding models excel at zero-shot tasks like visual retrieval and classification. Existing CLIP-based approaches embed images and text independently, and fuse the result. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone.
arXiv Detail & Related papers (2025-03-01T03:29:02Z)
Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems [16.49637074299509]
We have explored state-of-the-art vision language models (VLM) for vision-based transportation engineering tasks. The image classification task involves congestion detection and crack identification, whereas, for object detection, helmet violations were identified. We have applied open-source models such as CLIP, BLIP, OWL-ViT, Llava-Next, and closed-source GPT-4o to evaluate the performance of these VLM models.
arXiv Detail & Related papers (2024-09-03T20:24:37Z)
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC) This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark. FineMatch focuses on text and image mismatch detection and correction. We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z)
In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model [13.983810804606264]
We propose In-Context Prompt Learning (InCPL) for test-time visual recognition tasks. InCPL associates a new test sample with very few labeled examples as context information. We introduce a context-aware unsupervised loss to optimize visual prompts tailored to test samples.
arXiv Detail & Related papers (2024-03-10T08:15:51Z)
IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks [124.90137528319273]
In this paper, we present IMProv, a generative model that is able to in-context learn visual tasks from multimodal prompts. We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions. During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output.
arXiv Detail & Related papers (2023-12-04T09:48:29Z)
Evaluating how interactive visualizations can assist in finding samples where and how computer vision models make mistakes [1.76602679361245]
We present two interactive visualizations in the context of Sprite, a system for creating Computer Vision (CV) models. We study how these visualizations help Sprite's users identify (evaluate) and select (plan) images where a model is struggling and can lead to improved performance.
arXiv Detail & Related papers (2023-05-19T14:43:00Z)
Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection. Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning. We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z)
Detection and Captioning with Unseen Object Classes [12.894104422808242]
Test images may contain visual objects with no corresponding visual or textual training examples. We propose a detection-driven approach based on a generalized zero-shot detection model and a template-based sentence generation model. Our experiments show that the proposed zero-shot detection model obtains state-of-the-art performance on the MS-COCO dataset.
arXiv Detail & Related papers (2021-08-13T10:43:20Z)
Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework. We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.