GPT-4 as Evaluator: Evaluating Large Language Models on Pest Management in Agriculture
- URL: http://arxiv.org/abs/2403.11858v1
- Date: Mon, 18 Mar 2024 15:08:01 GMT
- Title: GPT-4 as Evaluator: Evaluating Large Language Models on Pest Management in Agriculture
- Authors: Shanglong Yang, Zhipeng Yuan, Shunbao Li, Ruoling Peng, Kang Liu, Po Yang,
- Abstract summary: The application of large language models (LLMs) in agriculture, particularly in pest management, remains nascent.
We aimed to prove the feasibility by evaluating the content of the pest management advice generated by LLMs, including the Generative Pre-trained Transformer (GPT) series from OpenAI and the FLAN series from Google.
We proposed an innovative approach, using GPT-4 as an evaluator, to score the generated content on Coherence, Logical Consistency, Fluency, Relevance, Comprehensibility, and Exhaustiveness.
- Score: 7.458004824488893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the rapidly evolving field of artificial intelligence (AI), the application of large language models (LLMs) in agriculture, particularly in pest management, remains nascent. We aimed to prove the feasibility by evaluating the content of the pest management advice generated by LLMs, including the Generative Pre-trained Transformer (GPT) series from OpenAI and the FLAN series from Google. Considering the context-specific properties of agricultural advice, automatically measuring or quantifying the quality of text generated by LLMs becomes a significant challenge. We proposed an innovative approach, using GPT-4 as an evaluator, to score the generated content on Coherence, Logical Consistency, Fluency, Relevance, Comprehensibility, and Exhaustiveness. Additionally, we integrated an expert system based on crop threshold data as a baseline to obtain scores for Factual Accuracy on whether pests found in crop fields should take management action. Each model's score was weighted by percentage to obtain a final score. The results showed that GPT-3.4 and GPT-4 outperform the FLAN models in most evaluation categories. Furthermore, the use of instruction-based prompting containing domain-specific knowledge proved the feasibility of LLMs as an effective tool in agriculture, with an accuracy rate of 72%, demonstrating LLMs' effectiveness in providing pest management suggestions.
Related papers
- Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset.
We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6.
Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z) - A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed.
We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness.
Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z) - LLMs for Enhanced Agricultural Meteorological Recommendations [0.0]
Agricultural meteorological recommendations are crucial for enhancing crop productivity and sustainability by providing farmers with actionable insights based on weather forecasts, soil conditions, and crop-specific data.
This paper presents a novel approach that leverages large language models (LLMs) and prompt engineering to improve the accuracy and relevance of these recommendations.
arXiv Detail & Related papers (2024-07-30T18:10:49Z) - Enhancing Agricultural Machinery Management through Advanced LLM Integration [0.7366405857677226]
The integration of artificial intelligence into agricultural practices has the potential to revolutionize efficiency and sustainability in farming.
This paper introduces a novel approach that leverages large language models (LLMs), particularly GPT-4, to enhance decision-making processes in agricultural machinery management.
arXiv Detail & Related papers (2024-07-30T06:49:55Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - KGPA: Robustness Evaluation for Large Language Models via Cross-Domain Knowledge Graphs [5.798411590796167]
This paper proposes a framework that systematically evaluates the robustness of large language models under adversarial attack scenarios.
Our framework generates original prompts from the triplets of knowledge graphs and creates adversarial prompts by poisoning.
Experiments show that adversarial robustness of the ChatGPT family ranks as GPT-4-turbo > GPT-4o > GPT-3.5-turbo, and the robustness of large language models is influenced by the professional domains in which they operate.
arXiv Detail & Related papers (2024-06-16T04:48:43Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - Biomedical knowledge graph-optimized prompt generation for large language models [1.6658478064349376]
Large Language Models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine.
Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation framework.
arXiv Detail & Related papers (2023-11-29T03:07:00Z) - Large Language Models as Automated Aligners for benchmarking
Vision-Language Models [48.4367174400306]
Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks.
Existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence.
In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient curation, measuring the alignment betweenVLMs and human intelligence and value through automatic data curation and assessment.
arXiv Detail & Related papers (2023-11-24T16:12:05Z) - GPT-4 as an Agronomist Assistant? Answering Agriculture Exams Using
Large Language Models [1.3999521658236698]
Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding across various domains.
We present a comprehensive evaluation of popular LLMs, such as Llama 2 and GPT, on their ability to answer agriculture-related questions.
We selected agriculture exams and benchmark datasets from three of the largest agriculture producer countries: Brazil, India, and the USA.
arXiv Detail & Related papers (2023-10-10T00:39:04Z) - LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities [66.36633042421387]
Large Language Models (LLMs) for Knowledge Graph (KG) construction and reasoning evaluated.
We propose AutoKG, a multi-agent-based approach employing LLMs and external sources for KG construction and reasoning.
arXiv Detail & Related papers (2023-05-22T15:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.