Embedding-based Retrieval with LLM for Effective Agriculture Information
Extracting from Unstructured Data
- URL: http://arxiv.org/abs/2308.03107v1
- Date: Sun, 6 Aug 2023 13:18:38 GMT
- Title: Embedding-based Retrieval with LLM for Effective Agriculture Information
Extracting from Unstructured Data
- Authors: Ruoling Peng, Kang Liu, Po Yang, Zhipeng Yuan, Shunbao Li
- Abstract summary: We explore using domain-agnostic general pre-trained large language model(LLM) to extract structured data from agricultural documents with minimal or no human intervention.
In comparison to existing methods, our approach achieves consistently better accuracy in the benchmark while maintaining efficiency.
- Score: 5.573704309892796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pest identification is a crucial aspect of pest control in agriculture.
However, most farmers are not capable of accurately identifying pests in the
field, and there is a limited number of structured data sources available for
rapid querying. In this work, we explored using domain-agnostic general
pre-trained large language model(LLM) to extract structured data from
agricultural documents with minimal or no human intervention. We propose a
methodology that involves text retrieval and filtering using embedding-based
retrieval, followed by LLM question-answering to automatically extract entities
and attributes from the documents, and transform them into structured data. In
comparison to existing methods, our approach achieves consistently better
accuracy in the benchmark while maintaining efficiency.
Related papers
- Few-Shot Adaptation of Grounding DINO for Agricultural Domain [0.29998889086656577]
Open-set object detection models like Grounding-DINO offer a potential solution to detect regions of interests based on text prompt input.
We propose an efficient few-shot adaptation method that simplifies the Grounding-DINO architecture by removing the text encoder module.
This method achieves superior performance across multiple agricultural datasets, including plant-weed detection, plant counting, insect identification, fruit counting, and remote sensing tasks.
arXiv Detail & Related papers (2025-04-09T19:57:25Z) - Dataset Protection via Watermarked Canaries in Retrieval-Augmented LLMs [67.0310240737424]
We introduce a novel approach to safeguard the ownership of text datasets and effectively detect unauthorized use by the RA-LLMs.
Our approach preserves the original data completely unchanged while protecting it by inserting specifically designed canary documents into the IP dataset.
During the detection process, unauthorized usage is identified by querying the canary documents and analyzing the responses of RA-LLMs.
arXiv Detail & Related papers (2025-02-15T04:56:45Z) - Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers [0.0]
This paper presents a machine learning framework that automates dataset mention detection across research domains.
We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset.
At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall.
arXiv Detail & Related papers (2025-02-14T16:16:02Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Value Alignment from Unstructured Text [32.9140028463247]
We introduce a systematic end-to-end methodology for aligning large language models (LLMs) to the implicit and explicit values represented in unstructured text data.
Our proposed approach leverages the use of scalable synthetic data generation techniques to effectively align the model to the values present in the unstructured data.
Our approach credibly aligns LLMs to the values embedded within documents, and shows improved performance against other approaches.
arXiv Detail & Related papers (2024-08-19T20:22:08Z) - CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation [76.31621715032558]
Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses.
We introduce CaLM, a novel verification framework.
Our framework empowers smaller LMs, which rely less on parametric memory, to validate the output of larger LMs.
arXiv Detail & Related papers (2024-06-08T06:04:55Z) - Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Revisiting Sparse Retrieval for Few-shot Entity Linking [33.15662306409253]
We propose an ELECTRA-based keyword extractor to denoise the mention context and construct a better query expression.
For training the extractor, we propose a distant supervision method to automatically generate training data based on overlapping tokens between mention contexts and entity descriptions.
Experimental results on the ZESHEL dataset demonstrate that the proposed method outperforms state-of-the-art models by a significant margin across all test domains.
arXiv Detail & Related papers (2023-10-19T03:51:10Z) - Information Extraction in Domain and Generic Documents: Findings from
Heuristic-based and Data-driven Approaches [0.0]
Information extraction plays important role in natural language processing.
Document genre and length influence on IE tasks.
No single method demonstrated overwhelming performance in both tasks.
arXiv Detail & Related papers (2023-06-30T20:43:27Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - Productive Crop Field Detection: A New Dataset and Deep Learning
Benchmark Results [1.2233362977312945]
In precision agriculture, detecting productive crop fields is an essential practice that allows the farmer to evaluate operating performance.
Previous studies explore different methods to detect crop fields using advanced machine learning algorithms.
We propose a high-quality dataset generated by machine operation combined with Sentinel-2 images.
arXiv Detail & Related papers (2023-05-19T20:30:59Z) - Mixture of Soft Prompts for Controllable Data Generation [21.84489422361048]
Mixture of Soft Prompts (MSP) is proposed as a tool for data augmentation rather than direct prediction.
Our method achieves state-of-the-art results on three benchmarks when compared against strong baselines.
arXiv Detail & Related papers (2023-03-02T21:13:56Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.