AgriGPT-VL: Agricultural Vision-Language Understanding Suite
- URL: http://arxiv.org/abs/2510.04002v2
- Date: Tue, 07 Oct 2025 13:17:08 GMT
- Title: AgriGPT-VL: Agricultural Vision-Language Understanding Suite
- Authors: Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, Shijian Li,
- Abstract summary: AgriGPT-VL Suite is a unified multimodal framework for agriculture.<n>We introduce Agri-3M-VL, the largest vision-language corpus for agriculture to our knowledge.<n>Second, we develop AgriGPT-VL, an agriculture-specialized vision-language model trained via a progressive curriculum of textual grounding.<n>Third, we establish AgriBench-VL-4K, a compact yet challenging evaluation suite with open-ended and image-grounded questions.
- Score: 12.521000582108888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the scarcity of domain-tailored models, curated vision-language corpora, and rigorous evaluation. To address these challenges, we present the AgriGPT-VL Suite, a unified multimodal framework for agriculture. Our contributions are threefold. First, we introduce Agri-3M-VL, the largest vision-language corpus for agriculture to our knowledge, curated by a scalable multi-agent data generator; it comprises 1M image-caption pairs, 2M image-grounded VQA pairs, 50K expert-level VQA instances, and 15K GRPO reinforcement learning samples. Second, we develop AgriGPT-VL, an agriculture-specialized vision-language model trained via a progressive curriculum of textual grounding, multimodal shallow/deep alignment, and GRPO refinement. This method achieves strong multimodal reasoning while preserving text-only capability. Third, we establish AgriBench-VL-4K, a compact yet challenging evaluation suite with open-ended and image-grounded questions, paired with multi-metric evaluation and an LLM-as-a-judge framework. Experiments show that AgriGPT-VL outperforms leading general-purpose VLMs on AgriBench-VL-4K, achieving higher pairwise win rates in the LLM-as-a-judge evaluation. Meanwhile, it remains competitive on the text-only AgriBench-13K with no noticeable degradation of language ability. Ablation studies further confirm consistent gains from our alignment and GRPO refinement stages. We will open source all of the resources to support reproducible research and deployment in low-resource agricultural settings.
Related papers
- AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence [13.233457989297358]
AgriGPT- Omni is an agricultural omni-framework that integrates speech, vision, and text in a unified framework.<n>We construct a scalable data synthesis and collection pipeline that converts agricultural texts and images into training data.<n>We train the first agricultural omni-model via a three-stage paradigm: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning.
arXiv Detail & Related papers (2025-12-11T13:24:40Z) - AgriDoctor: A Multimodal Intelligent Assistant for Agriculture [45.77373971125537]
AgriDoctor is a modular and multimodal framework designed for intelligent crop disease diagnosis and agricultural knowledge interaction.<n>To facilitate effective training and evaluation, we construct AgriMM, a benchmark comprising 400000 annotated disease images, 831 expert-curated knowledge entries, and 300000 bilingual prompts for intent-driven tool selection.<n>Experiments demonstrate that AgriDoctor, trained on AgriMM, significantly outperforms state-of-the-art LVLMs on fine-grained agricultural tasks.
arXiv Detail & Related papers (2025-09-21T11:51:57Z) - AgriGPT: a Large Language Model Ecosystem for Agriculture [16.497060004913806]
AgriGPT is a domain-specialized Large Language Models ecosystem for agriculture usage.<n>At its core, we design a scalable data engine that compiles credible data sources into Agri-342K, a high-quality, standardized question-answer dataset.<n>We employ Tri-RAG, a three-channel Retrieval-Augmented Generation framework combining dense retrieval, sparse retrieval, and multi-hop knowledge graph reasoning.
arXiv Detail & Related papers (2025-08-12T04:51:08Z) - AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models [19.265932725554833]
We propose AgriEval, the first comprehensive Chinese agricultural benchmark with three main characteristics.<n>AgriEval covers six major agriculture categories and 29 subcategories within agriculture, addressing four core cognitive scenarios.<n>AgriEval comprises 14,697 multiple-choice questions and 2,167 open-ended question-and-answer questions, establishing it as the most extensive agricultural benchmark available to date.
arXiv Detail & Related papers (2025-07-29T12:58:27Z) - AgroBench: Vision-Language Model Benchmark in Agriculture [25.52955831089068]
We introduce AgroBench, a benchmark for evaluating vision-language models (VLMs) across seven agricultural topics.<n>Our AgroBench covers a state-of-the-art range of categories, including 203 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities.
arXiv Detail & Related papers (2025-07-28T04:58:29Z) - Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain [1.0144032120138065]
This study generates multilingual (English, Hindi, Punjabi) synthetic datasets from agriculture-specific documents from India.<n> Evaluation on human-created datasets demonstrates significant improvements in factuality, relevance, and agricultural consensus.
arXiv Detail & Related papers (2025-07-22T19:25:10Z) - InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models [139.19991097260115]
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm.<n>In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs.<n>In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
arXiv Detail & Related papers (2025-04-14T17:59:25Z) - Agri-LLaVA: Knowledge-Infused Large Multimodal Assistant on Agricultural Pests and Diseases [49.782064512495495]
We construct the first multimodal instruction-following dataset in the agricultural domain.<n>This dataset covers over 221 types of pests and diseases with approximately 400,000 data entries.<n>We propose a knowledge-infused training method to develop Agri-LLaVA, an agricultural multimodal conversation system.
arXiv Detail & Related papers (2024-12-03T04:34:23Z) - VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [66.56298924208319]
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems.<n>Current assessment methods primarily rely on AI-annotated preference labels from traditional tasks.<n>We introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks.
arXiv Detail & Related papers (2024-11-26T14:08:34Z) - Large Language Models as Automated Aligners for benchmarking
Vision-Language Models [48.4367174400306]
Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks.
Existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence.
In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient curation, measuring the alignment betweenVLMs and human intelligence and value through automatic data curation and assessment.
arXiv Detail & Related papers (2023-11-24T16:12:05Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.