Rethinking Plant Disease Diagnosis: Bridging the Academic-Practical Gap with Vision Transformers and Zero-Shot Learning
- URL: http://arxiv.org/abs/2511.18989v1
- Date: Mon, 24 Nov 2025 11:08:01 GMT
- Title: Rethinking Plant Disease Diagnosis: Bridging the Academic-Practical Gap with Vision Transformers and Zero-Shot Learning
- Authors: Wassim Benabbas, Mohammed Brahimi, Samir Akhrouf, Bilal Fortas,
- Abstract summary: We investigate whether attention-based architectures and zero-shot learning approaches can bridge the gap between curated academic datasets and real-world agricultural conditions.<n>We evaluate three model categories: Convolutional Neural Networks (CNNs), Vision Transformers, and Contrastive Language-Image Pre-training (CLIP)-based zero-shot models.
- Score: 2.3536628395905974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in deep learning have enabled significant progress in plant disease classification using leaf images. Much of the existing research in this field has relied on the PlantVillage dataset, which consists of well-centered plant images captured against uniform, uncluttered backgrounds. Although models trained on this dataset achieve high accuracy, they often fail to generalize to real-world field images, such as those submitted by farmers to plant diagnostic systems. This has created a significant gap between published studies and practical application requirements, highlighting the necessity of investigating and addressing this issue. In this study, we investigate whether attention-based architectures and zero-shot learning approaches can bridge the gap between curated academic datasets and real-world agricultural conditions in plant disease classification. We evaluate three model categories: Convolutional Neural Networks (CNNs), Vision Transformers, and Contrastive Language-Image Pre-training (CLIP)-based zero-shot models. While CNNs exhibit limited robustness under domain shift, Vision Transformers demonstrate stronger generalization by capturing global contextual features. Most notably, CLIP models classify diseases directly from natural language descriptions without any task-specific training, offering strong adaptability and interpretability. These findings highlight the potential of zero-shot learning as a practical and scalable domain adaptation strategy for plant health diagnosis in diverse field environments.
Related papers
- A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis [82.01597026329158]
We introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS) for pathology-specific text-to-image synthesis.<n>CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy.<n>This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations.
arXiv Detail & Related papers (2025-12-15T10:22:43Z) - Toward Accessible Dermatology: Skin Lesion Classification Using Deep Learning Models on Mobile-Acquired Images [0.0]
In this work, we curate a large dataset of over 50 skin disease categories captured with mobile devices.<n>We evaluate multiple convolutional neural networks and Transformer-based architectures.<n>Our results underscore the potential of Transformer-based approaches for mobile-acquired skin lesion classification.
arXiv Detail & Related papers (2025-09-05T04:31:16Z) - Automated Multi-Class Crop Pathology Classification via Convolutional Neural Networks: A Deep Learning Approach for Real-Time Precision Agriculture [0.0]
This research introduces a Convolutional Neural Network (CNN)-based image classification system designed to automate the detection and classification of eight common crop diseases.<n>The solution is deployed on an open-source, mobile-compatible platform, enabling real-time image-based diagnostics for farmers in remote areas.
arXiv Detail & Related papers (2025-07-12T18:45:50Z) - PixCell: A generative foundation model for digital histopathology images [49.00921097924924]
We introduce PixCell, the first diffusion-based generative foundation model for histopathology.<n>We train PixCell on PanCan-30M, a vast, diverse dataset derived from 69,184 H&E-stained whole slide images covering various cancer types.
arXiv Detail & Related papers (2025-06-05T15:14:32Z) - A Vision-Language Foundation Model for Leaf Disease Identification [0.0]
Leaf disease identification plays a pivotal role in smart agriculture.<n>Many existing studies still struggle to integrate image and textual modalities to compensate for each other's limitations.<n>We propose SCOLD, a context-aware vision-language foundation model to address these challenges.
arXiv Detail & Related papers (2025-05-11T15:30:06Z) - Leveraging Vision-Language Embeddings for Zero-Shot Learning in Histopathology Images [7.048241543461529]
We propose a novel framework called Multi-Resolution Prompt-guided Hybrid Embedding (MR-PHE) to address these challenges in zero-shot histopathology image classification.<n>We introduce a hybrid embedding strategy that integrates global image embeddings with weighted patch embeddings.<n>A similarity-based patch weighting mechanism assigns attention-like weights to patches based on their relevance to class embeddings.
arXiv Detail & Related papers (2025-03-13T12:18:37Z) - Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions.
We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training.
Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Semantic Image Segmentation with Deep Learning for Vine Leaf Phenotyping [59.0626764544669]
In this study, we use Deep Learning methods to semantically segment grapevine leaves images in order to develop an automated object detection system for leaf phenotyping.
Our work contributes to plant lifecycle monitoring through which dynamic traits such as growth and development can be captured and quantified.
arXiv Detail & Related papers (2022-10-24T14:37:09Z) - Intelligent Masking: Deep Q-Learning for Context Encoding in Medical
Image Analysis [48.02011627390706]
We develop a novel self-supervised approach that occludes targeted regions to improve the pre-training procedure.
We show that training the agent against the prediction model can significantly improve the semantic features extracted for downstream classification tasks.
arXiv Detail & Related papers (2022-03-25T19:05:06Z) - Self-Supervised Vision Transformers Learn Visual Concepts in
Histopathology [5.164102666113966]
We conduct a search for good representations in pathology by training a variety of self-supervised models with validation on a variety of weakly-supervised and patch-level tasks.
Our key finding is in discovering that Vision Transformers using DINO-based knowledge distillation are able to learn data-efficient and interpretable features in histology images.
arXiv Detail & Related papers (2022-03-01T16:14:41Z) - Deep Low-Shot Learning for Biological Image Classification and
Visualization from Limited Training Samples [52.549928980694695]
In situ hybridization (ISH) gene expression pattern images from the same developmental stage are compared.
labeling training data with precise stages is very time-consuming even for biologists.
We propose a deep two-step low-shot learning framework to accurately classify ISH images using limited training images.
arXiv Detail & Related papers (2020-10-20T06:06:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.