Trust the Model: Compact VLMs as In-Context Judges for Image-Text Data Quality
- URL: http://arxiv.org/abs/2507.20156v1
- Date: Sun, 27 Jul 2025 07:20:25 GMT
- Title: Trust the Model: Compact VLMs as In-Context Judges for Image-Text Data Quality
- Authors: Daulet Toibazar, Kesen Wang, Sherif Mohamed, Abdulaziz Al-Badawi, Abdulrahman Alfulayt, Pedro J. Moreno,
- Abstract summary: Vision-language models (VLMs) extend the conventional large language models by integrating visual data, enabling richer multimodal reasoning.<n>We introduce a streamlined data filtration framework that employs a compact VLM, fine-tuned on a high-quality image-caption annotated dataset.<n>This model effectively evaluates and filters potential training samples based on caption and image quality and alignment.
- Score: 5.750869893508341
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) extend the conventional large language models by integrating visual data, enabling richer multimodal reasoning and significantly broadens the practical applications of AI. However, including visual inputs also brings new challenges in maintaining data quality. Empirical evidence consistently shows that carefully curated and representative training examples often yield superior results compared to simply increasing the quantity of data. Inspired by this observation, we introduce a streamlined data filtration framework that employs a compact VLM, fine-tuned on a high-quality image-caption annotated dataset. This model effectively evaluates and filters potential training samples based on caption and image quality and alignment. Unlike previous approaches, which typically add auxiliary filtration modules on top of existing full-scale VLMs, our method exclusively utilizes the inherent evaluative capability of a purpose-built small VLM. This strategy eliminates the need for extra modules and reduces training overhead. Our lightweight model efficiently filters out inaccurate, noisy web data, improving image-text alignment and caption linguistic fluency. Experimental results show that datasets underwent high-precision filtration using our compact VLM perform on par with, or even surpass, larger and noisier datasets gathered through high-volume web crawling. Thus, our method provides a lightweight yet robust solution for building high-quality vision-language training corpora. \\ \textbf{Availability and implementation:} Our compact VLM filtration model, training data, utility scripts, and Supplementary data (Appendices) are freely available at https://github.com/daulettoibazar/Compact_VLM_Filter.
Related papers
- HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models [15.877790469608662]
We introduce an LVLM-driven data refinement pipeline to enhance the quality of image-text pair data.<n>We propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags.<n>Our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks.
arXiv Detail & Related papers (2025-07-30T07:21:36Z) - Improving Large Vision-Language Models' Understanding for Field Data [62.917026891829025]
We introduce FieldLVLM, a framework designed to improve large vision-language models' understanding of field data.<n>FieldLVLM consists of two main components: a field-aware language generation strategy and a data-compressed multimodal model tuning.<n> Experimental results on newly proposed benchmark datasets demonstrate that FieldLVLM significantly outperforms existing methods in tasks involving scientific field data.
arXiv Detail & Related papers (2025-07-24T11:28:53Z) - Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models [9.238739743596236]
We propose a novel score model trained on large-scale RS visionlanguage preference data for automated quality assessment.<n>Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs with the top 30% of data ranked by our score model achieves superior interpretation accuracy.
arXiv Detail & Related papers (2025-03-02T05:44:56Z) - Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation [57.34255010956452]
This work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective.<n>We propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data.<n>Our proposed method achieves performance comparable to or even superior to that of baselines trained with significantly more samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning [1.6570772838074355]
multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA)
Recent efforts primarily focus on scaling up training datasets through data collection and synthesis.
We propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development.
arXiv Detail & Related papers (2024-07-29T17:04:34Z) - Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language [41.40908753726324]
Diffusion models can generate realistic and diverse images, potentially facilitating data availability for data-intensive perception tasks.<n>We present textbfAuto textbfCherry-textbfPicker (ACP), a novel framework that generates high-quality cross-modality training samples at scale.
arXiv Detail & Related papers (2024-06-28T17:53:18Z) - Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings [16.28853186016663]
We create synthetic image-text pairs for efficient and effective Visual-Language Models (VLMs) training.
Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM.
Our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data.
arXiv Detail & Related papers (2024-03-12T15:36:42Z) - Your Vision-Language Model Itself Is a Strong Filter: Towards
High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs)
In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM.
In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z) - Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset.
Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks.
Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.