Related papers: MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations

MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations

URL: http://arxiv.org/abs/2506.20100v1
Date: Wed, 25 Jun 2025 03:07:54 GMT
Title: MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations
Authors: Vardhan Dongre, Chi Gui, Shubham Garg, Hooshang Nayyeri, Gokhan Tur, Dilek Hakkani-Tür, Vikram S. Adve,
Abstract summary: MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context.<n>Grounded in over 35,000 real user-expert interactions, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios.<n>The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases.
Score: 9.649908672930815
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning and decision-making in consultative interaction settings. Designed for the agriculture domain, MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context, offering a high-fidelity benchmark for evaluating models on grounded reasoning, clarification strategies, and long-form generation in a real-world, knowledge-intensive domain. Grounded in over 35,000 real user-expert interactions and curated through a carefully designed multi-step pipeline, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios. The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases, making it one of the most taxonomically diverse benchmarks available for vision-language models, grounded in the real world. Unlike existing benchmarks that rely on well-specified user inputs and closed-set taxonomies, MIRAGE features underspecified, context-rich scenarios with open-world settings, requiring models to infer latent knowledge gaps, handle rare entities, and either proactively guide the interaction or respond. Project Page: https://mirage-benchmark.github.io

Related papers

IM-Chat: A Multi-agent LLM-based Framework for Knowledge Transfer in Injection Molding Industry [1.3369318110920576]
This study introduces IM-Chat, a multi-agent framework based on large language models (LLMs)<n> IM-Chat integrates both limited documented knowledge (e.g., troubleshooting tables, manuals) and extensive field data modeled through a data-driven process condition generator.<n>Performance was assessed across 100 single-tool and 60 hybrid tasks for GPT-4o, GPT-4o-mini, and GPT-3.5-turboizable.
arXiv Detail & Related papers (2025-07-21T06:13:53Z)
MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation [56.87891213797931]
We present MTR-Bench for Large Language Models' Multi-Turn Reasoning evaluation.<n>Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities.<n>MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations.
arXiv Detail & Related papers (2025-05-21T17:59:12Z)
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind [16.96145027280737]
We introduce AgroMind, a benchmark for agricultural remote sensing (RS)<n>AgroMind covers four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning.<n>We evaluate 18 open-source LMMs and 3 closed-source models on AgroMind.
arXiv Detail & Related papers (2025-05-18T02:45:19Z)
A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis [5.006697347461899]
We present the crop disease domain multimodal dataset, a pioneering resource designed to advance the field of agricultural research.<n>The dataset comprises 137,000 images of various crop diseases, accompanied by 1 million question-answer pairs that span a broad spectrum of agricultural knowledge.<n>We demonstrate the utility of the dataset by finetuning state-of-the-art multimodal models, showcasing significant improvements in crop disease diagnosis.
arXiv Detail & Related papers (2025-03-10T06:37:42Z)
The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study [46.55831783809377]
Retrieval-augmented generation (RAG) is increasingly recognized as an effective approach to mitigating the hallucination of large language models (LLMs)<n>We develop PruningRAG, a plug-and-play RAG framework that uses multi-granularity pruning strategies to more effectively incorporate relevant context and mitigate the negative impact of misleading information.
arXiv Detail & Related papers (2024-09-03T03:31:37Z)
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z)
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning [49.3242278912771]
We introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning) The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs. It significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets.
arXiv Detail & Related papers (2024-05-31T14:23:49Z)
Multiple Expert Brainstorming for Domain Adaptive Person Re-identification [140.3998019639158]
We propose a multiple expert brainstorming network (MEB-Net) for domain adaptive person re-ID. MEB-Net adopts a mutual learning strategy, where multiple networks with different architectures are pre-trained within a source domain. Experiments on large-scale datasets demonstrate the superior performance of MEB-Net over the state-of-the-arts.
arXiv Detail & Related papers (2020-07-03T08:16:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.