Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain
- URL: http://arxiv.org/abs/2507.16974v1
- Date: Tue, 22 Jul 2025 19:25:10 GMT
- Title: Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain
- Authors: Rishemjit Kaur, Arshdeep Singh Bhankhar, Surangika Ranathunga, Jashanpreet Singh Salh, Sudhir Rajput, Vidhi, Kashish Mahendra, Bhavika Berwal, Ritesh Kumar,
- Abstract summary: Large language models (LLMs) in agriculture typically offer generic advisories, lacking precision in local and multilingual contexts.<n>This study generates multilingual synthetic agricultural datasets (English, Hindi, Punjabi) from agriculture-specific documents and fine-tuning language-specific LLMs.<n>Our evaluation on curated multilingual datasets demonstrates significant improvements in factual accuracy, relevance, and agricultural consensus.
- Score: 1.0144032120138065
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Enabling farmers to access accurate agriculture-related information in their native languages in a timely manner is crucial for the success of the agriculture field. Although large language models (LLMs) can be used to implement Question Answering (QA) systems, simply using publicly available general-purpose LLMs in agriculture typically offer generic advisories, lacking precision in local and multilingual contexts due to insufficient domain-specific training and scarcity of high-quality, region-specific datasets. Our study addresses these limitations by generating multilingual synthetic agricultural datasets (English, Hindi, Punjabi) from agriculture-specific documents and fine-tuning language-specific LLMs. Our evaluation on curated multilingual datasets demonstrates significant improvements in factual accuracy, relevance, and agricultural consensus for the fine-tuned models compared to their baseline counterparts. These results highlight the efficacy of synthetic data-driven, language-specific fine-tuning as an effective strategy to improve the performance of LLMs in agriculture, especially in multilingual and low-resource settings. By enabling more accurate and localized agricultural advisory services, this study provides a meaningful step toward bridging the knowledge gap in AI-driven agricultural solutions for diverse linguistic communities.
Related papers
- AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models [19.265932725554833]
We propose AgriEval, the first comprehensive Chinese agricultural benchmark with three main characteristics.<n>AgriEval covers six major agriculture categories and 29 subcategories within agriculture, addressing four core cognitive scenarios.<n>AgriEval comprises 14,697 multiple-choice questions and 2,167 open-ended question-and-answer questions, establishing it as the most extensive agricultural benchmark available to date.
arXiv Detail & Related papers (2025-07-29T12:58:27Z) - AgriCHN: A Comprehensive Cross-domain Resource for Chinese Agricultural Named Entity Recognition [30.51577375197722]
We present AgriCHN, a comprehensive open-source Chinese resource designed to promote the accuracy of automated agricultural entity annotation.<n>The dataset has been meticulously curated from a wealth of agricultural articles, comprising a total of 4,040 sentences and encapsulating 15,799 agricultural entity mentions.<n>A benchmark task has also been constructed using several state-of-the-art neural NER models.
arXiv Detail & Related papers (2025-06-21T04:21:11Z) - Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models [52.22235443948351]
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs)<n>Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale.<n>JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings.
arXiv Detail & Related papers (2025-05-28T11:06:54Z) - AgroLLM: Connecting Farmers and Agricultural Practices through Large Language Models for Enhanced Knowledge Transfer and Practical Application [1.9643850583333375]
AgroLLM is designed to enhance knowledge-sharing and education in agriculture using Large Language Models (LLMs) and a Retrieval-Augmented Generation (RAG) framework.<n>A comparative study of three advanced models was conducted to evaluate performance across four key agricultural domains.<n>ChatGPT-4o Mini with RAG achieved the highest accuracy at 93%.
arXiv Detail & Related papers (2025-02-28T04:13:18Z) - Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.<n>For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.<n>We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - Agri-LLaVA: Knowledge-Infused Large Multimodal Assistant on Agricultural Pests and Diseases [49.782064512495495]
We construct the first multimodal instruction-following dataset in the agricultural domain.<n>This dataset covers over 221 types of pests and diseases with approximately 400,000 data entries.<n>We propose a knowledge-infused training method to develop Agri-LLaVA, an agricultural multimodal conversation system.
arXiv Detail & Related papers (2024-12-03T04:34:23Z) - AgriBench: A Hierarchical Agriculture Benchmark for Multimodal Large Language Models [4.12825661607328]
AgriBench is the first benchmark designed to evaluate MultiModal Large Language Models (MM-LLMs) for agriculture applications.<n>We propose MM-LUCAS, a multimodal agriculture dataset that includes 1,784 landscape images, segmentation masks, depth maps, and detailed annotations.<n>This work presents a groundbreaking perspective in advancing agriculture MM-LLMs and is still in progress, offering valuable insights for future developments and innovations in specific expert knowledge-based MM-LLMs.
arXiv Detail & Related papers (2024-11-30T12:59:03Z) - AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning [30.034193330398292]
We propose an approach to construct instruction-tuning data that harnesses vision-only data for the agriculture domain.<n>We utilize diverse agricultural datasets spanning multiple domains, curate class-specific information, and employ large language models (LLMs) to construct an expert-tuning set.<n>We expert-tuned and created AgroGPT, an efficient LMM that can hold complex agriculture-related conversations and provide useful insights.
arXiv Detail & Related papers (2024-10-10T22:38:26Z) - Generating Diverse Agricultural Data for Vision-Based Farming Applications [74.79409721178489]
This model is capable of simulating distinct growth stages of plants, diverse soil conditions, and randomized field arrangements under varying lighting conditions.
Our dataset includes 12,000 images with semantic labels, offering a comprehensive resource for computer vision tasks in precision agriculture.
arXiv Detail & Related papers (2024-03-27T08:42:47Z) - Large Language Models for Data Annotation and Synthesis: A Survey [49.8318827245266]
This survey focuses on the utility of Large Language Models for data annotation and synthesis.<n>It includes an in-depth taxonomy of data types that LLMs can annotate, a review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation and synthesis.
arXiv Detail & Related papers (2024-02-21T00:44:04Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Embedding-based Retrieval with LLM for Effective Agriculture Information
Extracting from Unstructured Data [5.573704309892796]
We explore using domain-agnostic general pre-trained large language model(LLM) to extract structured data from agricultural documents with minimal or no human intervention.
In comparison to existing methods, our approach achieves consistently better accuracy in the benchmark while maintaining efficiency.
arXiv Detail & Related papers (2023-08-06T13:18:38Z) - QAmeleon: Multilingual QA with Only 5 Examples [71.80611036543633]
We show how to leverage pre-trained language models under a few-shot learning setting.
Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained.
Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines.
arXiv Detail & Related papers (2022-11-15T16:14:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.