Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
- URL: http://arxiv.org/abs/2512.15977v1
- Date: Wed, 17 Dec 2025 21:22:44 GMT
- Title: Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
- Authors: Earl Ranario, Mason J. Earles,
- Abstract summary: Vision models (VLMs) are proposed as general-purpose solutions for visual recognition tasks.<n>We benchmark a diverse set open and closed-source VLMs on 27 agricultural classification datasets from the AgML collection.
- Score: 0.8594140167290097
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural classification datasets from the AgML collection, spanning 162 classes across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (for example, from 21% to 30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
Related papers
- LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases [0.0]
LeafBench is a visual question-answering benchmark developed to evaluate the capabilities of Vision-Language Models (VLMs) in understanding plant diseases.<n>The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs.<n> Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities.
arXiv Detail & Related papers (2026-02-14T08:10:27Z) - Learning Consistent Taxonomic Classification through Hierarchical Reasoning [61.372270953201955]
We propose a two-stage, hierarchy-based reasoning framework designed to improve leaf-level accuracy and hierarchical consistency in taxonomic classification.<n>Our framework, implemented on the Qwen2.5-VL-7B model, outperforms its original 72B counterpart by over 10% in both leaf-level and hierarchical consistency accuracy.
arXiv Detail & Related papers (2026-01-21T03:00:00Z) - Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning [22.34625628938106]
We propose textbfAgri-R1, a reasoning-enhanced large model for agriculture.<n>Our framework high-quality reasoning data generation via vision-language synthesis and LLM-based filtering.<n>We show a +23.2% relative gain in disease recognition accuracy, +33.3% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization.
arXiv Detail & Related papers (2026-01-08T07:34:37Z) - Weed Detection in Challenging Field Conditions: A Semi-Supervised Framework for Overcoming Shadow Bias and Data Scarcity [7.019137213828947]
This study tackles both issues through a diagnostic-driven, semi-supervised framework.<n>We use a unique dataset of approximately 975 labeled and 10,000 unlabeled images of Guinea Grass in sugarcane.<n>Our work provides a clear and field-tested framework for developing, diagnosing, and improving robust computer vision systems.
arXiv Detail & Related papers (2025-08-27T01:55:47Z) - Adapting Vision-Language Models Without Labels: A Comprehensive Survey [74.17944178027015]
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks.<n>Recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data.<n>We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms.
arXiv Detail & Related papers (2025-08-07T16:27:37Z) - Self-Consistency in Vision-Language Models for Precision Agriculture: Multi-Response Consensus for Crop Disease Management [0.0]
This work presents a domain-aware framework for agricultural image processing that combines prompt-based expert evaluation with self-consistency mechanisms.<n>We introduce two key innovations: (1) a prompt-based evaluation protocol that configures a language model as an expert plant pathologist for scalable assessment of image analysis outputs, and (2) a cosine-consistency self-voting mechanism that generates multiple candidate responses from agricultural images.<n>Our approach improves diagnostic accuracy from 82.2% to 87.8%, symptom analysis from 38.9% to 52.2%, and treatment recommendation from 27.8% to 43.3
arXiv Detail & Related papers (2025-07-08T18:32:21Z) - Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks [0.5009853409756729]
This study investigates the effectiveness of combining multimodal Large Language Models (LLMs) with Convolutional Neural Networks (CNNs) for automated plant disease classification using leaf imagery.<n>We evaluate model performance across zero-shot, few-shot, and progressive fine-tuning scenarios.
arXiv Detail & Related papers (2025-04-29T04:31:58Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - Ranked from Within: Ranking Large Multimodal Models Without Labels [73.96543593298426]
We show that uncertainty scores derived from softmax distributions provide a robust basis for ranking models across various tasks.<n>This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.
arXiv Detail & Related papers (2024-12-09T13:05:43Z) - Leveraging Vision Language Models for Specialized Agricultural Tasks [19.7240633020344]
We present AgEval, a benchmark for assessing Vision Language Models' capabilities in plant stress phenotyping.<n>Our study explores how general-purpose VLMs can be leveraged for domain-specific tasks with only a few annotated examples.<n>Our results demonstrate VLMs' rapid adaptability to specialized tasks, with the best-performing model showing an increase in F1 scores from 46.24% to 73.37% in 8-shot identification.
arXiv Detail & Related papers (2024-07-29T00:39:51Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Improving Visual Grounding by Encouraging Consistent Gradient-based
Explanations [58.442103936918805]
We show that Attention Mask Consistency produces superior visual grounding results than previous methods.
AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
arXiv Detail & Related papers (2022-06-30T17:55:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.