Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind
- URL: http://arxiv.org/abs/2505.12207v1
- Date: Sun, 18 May 2025 02:45:19 GMT
- Title: Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind
- Authors: Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, Haohuan Fu, Jianxi Huang, Juepeng Zheng,
- Abstract summary: We introduce AgroMind, a benchmark for agricultural remote sensing (RS)<n>AgroMind covers four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning.<n>We evaluate 18 open-source LMMs and 3 closed-source models on AgroMind.
- Score: 16.96145027280737
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating eight public datasets and one private farmland plot dataset, containing 25,026 QA pairs and 15,556 images. The pipeline begins with multi-source data preprocessing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 18 open-source LMMs and 3 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work. Data and code can be accessed at https://rssysu.github.io/AgroMind/.
Related papers
- AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models [19.265932725554833]
We propose AgriEval, the first comprehensive Chinese agricultural benchmark with three main characteristics.<n>AgriEval covers six major agriculture categories and 29 subcategories within agriculture, addressing four core cognitive scenarios.<n>AgriEval comprises 14,697 multiple-choice questions and 2,167 open-ended question-and-answer questions, establishing it as the most extensive agricultural benchmark available to date.
arXiv Detail & Related papers (2025-07-29T12:58:27Z) - AgroBench: Vision-Language Model Benchmark in Agriculture [25.52955831089068]
We introduce AgroBench, a benchmark for evaluating vision-language models (VLMs) across seven agricultural topics.<n>Our AgroBench covers a state-of-the-art range of categories, including 203 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities.
arXiv Detail & Related papers (2025-07-28T04:58:29Z) - Multimodal Agricultural Agent Architecture (MA3): A New Paradigm for Intelligent Agricultural Decision-Making [32.62816270192696]
Modern agriculture faces dual challenges: optimizing production efficiency and achieving sustainable development.<n>To address these challenges, this study proposes an innovative textbfMultimodal textbfAgricultural textbfAgent textbfArchitecture (textbfMA3)<n>This study constructs a multimodal agricultural agent dataset encompassing five major tasks: classification, detection, Visual Question Answering (VQA), tool selection, and agent evaluation.
arXiv Detail & Related papers (2025-04-07T07:32:41Z) - Composed Multi-modal Retrieval: A Survey of Approaches and Applications [81.54640206021757]
Composed Multi-modal Retrieval (CMR) emerges as a pivotal next-generation technology.<n>CMR enables users to query images or videos by integrating a reference visual input with textual modifications.<n>This paper provides a comprehensive survey of CMR, covering its fundamental challenges, technical advancements, and applications.
arXiv Detail & Related papers (2025-03-03T09:18:43Z) - OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.<n>Our benchmark is characterized by its multi-dimensional evaluation framework.<n>Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z) - Agri-LLaVA: Knowledge-Infused Large Multimodal Assistant on Agricultural Pests and Diseases [49.782064512495495]
We construct the first multimodal instruction-following dataset in the agricultural domain.<n>This dataset covers over 221 types of pests and diseases with approximately 400,000 data entries.<n>We propose a knowledge-infused training method to develop Agri-LLaVA, an agricultural multimodal conversation system.
arXiv Detail & Related papers (2024-12-03T04:34:23Z) - AgriBench: A Hierarchical Agriculture Benchmark for Multimodal Large Language Models [4.12825661607328]
AgriBench is the first benchmark designed to evaluate MultiModal Large Language Models (MM-LLMs) for agriculture applications.<n>We propose MM-LUCAS, a multimodal agriculture dataset that includes 1,784 landscape images, segmentation masks, depth maps, and detailed annotations.<n>This work presents a groundbreaking perspective in advancing agriculture MM-LLMs and is still in progress, offering valuable insights for future developments and innovations in specific expert knowledge-based MM-LLMs.
arXiv Detail & Related papers (2024-11-30T12:59:03Z) - LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models [55.903148392998965]
We introduce LOKI, a novel benchmark designed to evaluate the ability of LMMs to detect synthetic data across multiple modalities.<n>The benchmark includes coarse-grained judgment and multiple-choice questions, as well as fine-grained anomaly selection and explanation tasks.<n>We evaluate 22 open-source LMMs and 6 closed-source models on LOKI, highlighting their potential as synthetic data detectors and also revealing some limitations in the development of LMM capabilities.
arXiv Detail & Related papers (2024-10-13T05:26:36Z) - AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning [30.034193330398292]
We propose an approach to construct instruction-tuning data that harnesses vision-only data for the agriculture domain.<n>We utilize diverse agricultural datasets spanning multiple domains, curate class-specific information, and employ large language models (LLMs) to construct an expert-tuning set.<n>We expert-tuned and created AgroGPT, an efficient LMM that can hold complex agriculture-related conversations and provide useful insights.
arXiv Detail & Related papers (2024-10-10T22:38:26Z) - UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios [60.492736455572015]
We present UrBench, a benchmark designed for evaluating LMMs in complex multi-view urban scenarios.<n>UrBench contains 11.6K meticulously curated questions at both region-level and role-level.<n>Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects.
arXiv Detail & Related papers (2024-08-30T13:13:35Z) - R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models [51.468732121824125]
Large language models have achieved remarkable success on general NLP tasks, but they may fall short for domain-specific problems.
Existing evaluation tools only provide a few baselines and evaluate them on various domains without mining the depth of domain knowledge.
In this paper, we address the challenges of evaluating RALLMs by introducing the R-Eval toolkit, a Python toolkit designed to streamline the evaluation of different RAGs.
arXiv Detail & Related papers (2024-06-17T15:59:49Z) - PhenoBench -- A Large Dataset and Benchmarks for Semantic Image Interpretation in the Agricultural Domain [29.395926321984565]
We present an annotated dataset and benchmarks for the semantic interpretation of real agricultural fields.
Our dataset recorded with a UAV provides high-quality, pixel-wise annotations of crops and weeds, but also crop leaf instances at the same time.
We provide benchmarks for various tasks on a hidden test set comprised of different fields.
arXiv Detail & Related papers (2023-06-07T16:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.