Related papers: AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

URL: http://arxiv.org/abs/2511.23253v1
Date: Fri, 28 Nov 2025 15:02:19 GMT
Title: AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture
Authors: Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Jing Wu, Zurong Mai, Shuohong Lou, Yuhang Chen, Henglian Huang, Xiaoya Fan, Yang Zhang, Lingyuan Zhao, Haohuan Fu, Huang Jianxi, Juepeng Zheng,
Abstract summary: We introduce AgriCoT, a VQA dataset that incorporates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of Vision-Language Models (VLMs)<n>With 4,535 carefully curated samples, AgriCoT offers a comprehensive and robust evaluation of reasoning abilities.<n>Our evaluations, conducted with 26 representative VLMs, reveal a notable and significant gap in their reasoning capabilities.
Score: 20.836370409464916
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent advancements in Vision-Language Models (VLMs) have significantly transformed various industries. In agriculture, these dual-modal capabilities offer promising applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. While several Visual Question Answering (VQA) datasets and benchmarks have been developed to evaluate VLM performance, they often fail to adequately assess the critical reasoning and problem-solving skills required in complex agricultural contexts. To address this gap, we introduce AgriCoT, a VQA dataset that incorporates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,535 carefully curated samples, AgriCoT offers a comprehensive and robust evaluation of reasoning abilities for VLMs, particularly in zero-shot scenarios, by focusing on their capacity to engage in logical reasoning and effective problem-solving. Our evaluations, conducted with 26 representative VLMs, including both proprietary and open-source models, reveal that while some proprietary models excel at answering questions, there is a notable and significant gap in their reasoning capabilities. This underscores the importance of incorporating CoT for more precise and effective assessments. Our dataset are available at https://huggingface.co/datasets/wenyb/AgriCoT.

Related papers

VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering [53.662676566188175]
A key bottleneck lies in the lack of public, large-scale, high-quality Scientific Visual Question Answering (SVQA) datasets.<n>We propose a verification-centric Generate-then-Verify framework that first generates QA pairs with figure-associated textual context.<n>We instantiate this framework to curate VeriSciQA, a dataset of 20,351 QA pairs spanning 20 scientific domains and 12 figure types.
arXiv Detail & Related papers (2025-11-25T04:14:52Z)
Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale [70.23466957404891]
We introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions.<n>We show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks.
arXiv Detail & Related papers (2025-11-07T20:50:54Z)
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind [16.96145027280737]
We introduce AgroMind, a benchmark for agricultural remote sensing (RS)<n>AgroMind covers four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning.<n>We evaluate 20 open-source LMMs and 4 closed-source models on AgroMind.
arXiv Detail & Related papers (2025-05-18T02:45:19Z)
Multimodal Agricultural Agent Architecture (MA3): A New Paradigm for Intelligent Agricultural Decision-Making [32.62816270192696]
Modern agriculture faces dual challenges: optimizing production efficiency and achieving sustainable development.<n>To address these challenges, this study proposes an innovative textbfMultimodal textbfAgricultural textbfAgent textbfArchitecture (textbfMA3)<n>This study constructs a multimodal agricultural agent dataset encompassing five major tasks: classification, detection, Visual Question Answering (VQA), tool selection, and agent evaluation.
arXiv Detail & Related papers (2025-04-07T07:32:41Z)
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving [45.35559773691414]
$textbfVLADBench spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute, and Decision-Making and Planning.<n>A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts.<n>The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems.
arXiv Detail & Related papers (2025-03-27T13:45:47Z)
Adaptive Distraction: Probing LLM Contextual Robustness with Automated Tree Search [76.54475437069395]
Large Language Models (LLMs) often struggle to maintain their original performance when faced with semantically coherent but task-irrelevant contextual information.<n>We propose a dynamic distraction generation framework based on tree search, where the generation process is guided by model behavior.
arXiv Detail & Related papers (2025-02-03T18:43:36Z)
Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z)
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.<n>Our benchmark is characterized by its multi-dimensional evaluation framework.<n>Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z)
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [65.92331309449015]
We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability.<n>Through an extensive evaluation of nine popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
Towards Flexible Evaluation for Generative Visual Question Answering [17.271448204525612]
This paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on Visual Question Answering (VQA) datasets. In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of VQA evaluation.
arXiv Detail & Related papers (2024-08-01T05:56:34Z)
Leveraging Vision Language Models for Specialized Agricultural Tasks [19.7240633020344]
We present AgEval, a benchmark for assessing Vision Language Models' capabilities in plant stress phenotyping.<n>Our study explores how general-purpose VLMs can be leveraged for domain-specific tasks with only a few annotated examples.<n>Our results demonstrate VLMs' rapid adaptability to specialized tasks, with the best-performing model showing an increase in F1 scores from 46.24% to 73.37% in 8-shot identification.
arXiv Detail & Related papers (2024-07-29T00:39:51Z)
KNVQA: A Benchmark for evaluation knowledge-based VQA [8.602776661652083]
Large vision-language models (LVLMs) have made significant progress due to their strong perception and reasoning capabilities in the visual and language systems. LVLMs are still plagued by the two critical issues of object hallucination and factual accuracy, which limit the practicality of LVLMs in different scenarios. We propose a novel KNVQA-Eval, which is devoted to knowledge-based VQA task evaluation to reflect the factuality of multimodal LVLMs.
arXiv Detail & Related papers (2023-11-21T14:39:18Z)
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs. We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency. We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.