AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence
- URL: http://arxiv.org/abs/2512.10624v1
- Date: Thu, 11 Dec 2025 13:24:40 GMT
- Title: AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence
- Authors: Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Jianyu Zhang, Xiao Xu, Nueraili Aierken, Shijian Li,
- Abstract summary: AgriGPT- Omni is an agricultural omni-framework that integrates speech, vision, and text in a unified framework.<n>We construct a scalable data synthesis and collection pipeline that converts agricultural texts and images into training data.<n>We train the first agricultural omni-model via a three-stage paradigm: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning.
- Score: 13.233457989297358
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the lack of multilingual speech data, unified multimodal architectures, and comprehensive evaluation benchmarks. To address these challenges, we present AgriGPT-Omni, an agricultural omni-framework that integrates speech, vision, and text in a unified framework. First, we construct a scalable data synthesis and collection pipeline that converts agricultural texts and images into training data, resulting in the largest agricultural speech dataset to date, including 492K synthetic and 1.4K real speech samples across six languages. Second, based on this, we train the first agricultural omni-model via a three-stage paradigm: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning, enabling unified reasoning across languages and modalities. Third, we propose AgriBench-Omni-2K, the first tri-modal benchmark for agriculture, covering diverse speech-vision-text tasks and multilingual slices, with standardized protocols and reproducible tools. Experiments show that AgriGPT-Omni significantly outperforms general-purpose baselines on multilingual and multimodal reasoning as well as real-world speech understanding. All models, data, benchmarks, and code will be released to promote reproducible research, inclusive agricultural intelligence, and sustainable AI development for low-resource regions.
Related papers
- AgriGPT-VL: Agricultural Vision-Language Understanding Suite [12.521000582108888]
AgriGPT-VL Suite is a unified multimodal framework for agriculture.<n>We introduce Agri-3M-VL, the largest vision-language corpus for agriculture to our knowledge.<n>Second, we develop AgriGPT-VL, an agriculture-specialized vision-language model trained via a progressive curriculum of textual grounding.<n>Third, we establish AgriBench-VL-4K, a compact yet challenging evaluation suite with open-ended and image-grounded questions.
arXiv Detail & Related papers (2025-10-05T02:30:11Z) - A Multimodal Conversational Assistant for the Characterization of Agricultural Plots from Geospatial Open Data [0.0]
This study presents an open-source conversational assistant that integrates multimodal retrieval and large language models (LLMs)<n>The proposed architecture combines orthophotos, Sentinel-2 vegetation indices, and user-provided documents through retrieval-augmented generation (RAG)<n>Preliminary results show that the system is capable of generating clear, relevant, and context-aware responses to agricultural queries.
arXiv Detail & Related papers (2025-09-22T09:02:53Z) - AgriDoctor: A Multimodal Intelligent Assistant for Agriculture [45.77373971125537]
AgriDoctor is a modular and multimodal framework designed for intelligent crop disease diagnosis and agricultural knowledge interaction.<n>To facilitate effective training and evaluation, we construct AgriMM, a benchmark comprising 400000 annotated disease images, 831 expert-curated knowledge entries, and 300000 bilingual prompts for intent-driven tool selection.<n>Experiments demonstrate that AgriDoctor, trained on AgriMM, significantly outperforms state-of-the-art LVLMs on fine-grained agricultural tasks.
arXiv Detail & Related papers (2025-09-21T11:51:57Z) - AgriGPT: a Large Language Model Ecosystem for Agriculture [16.497060004913806]
AgriGPT is a domain-specialized Large Language Models ecosystem for agriculture usage.<n>At its core, we design a scalable data engine that compiles credible data sources into Agri-342K, a high-quality, standardized question-answer dataset.<n>We employ Tri-RAG, a three-channel Retrieval-Augmented Generation framework combining dense retrieval, sparse retrieval, and multi-hop knowledge graph reasoning.
arXiv Detail & Related papers (2025-08-12T04:51:08Z) - Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain [1.0144032120138065]
This study generates multilingual (English, Hindi, Punjabi) synthetic datasets from agriculture-specific documents from India.<n> Evaluation on human-created datasets demonstrates significant improvements in factuality, relevance, and agricultural consensus.
arXiv Detail & Related papers (2025-07-22T19:25:10Z) - Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition [57.131546757903834]
Lyra is an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction.<n>Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.
arXiv Detail & Related papers (2024-12-12T17:50:39Z) - Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for
Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems [64.40789703661987]
Multi3WOZ is a novel multilingual, multi-domain, multi-parallel ToD dataset.
It is large-scale and offers culturally adapted dialogs in 4 languages.
We describe a complex bottom-up data collection process that yielded the final dataset.
arXiv Detail & Related papers (2023-07-26T08:29:42Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for
Natural Language Understanding in Task-Oriented Dialogue [115.32009638844059]
We extend the English only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages.
Because of its multi-intent property, MULTI3NLU++ represents complex and natural user goals.
We use MULTI3NLU++ to benchmark state-of-the-art multilingual models for the Natural Language Understanding tasks of intent detection and slot labelling.
arXiv Detail & Related papers (2022-12-20T17:34:25Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.