PlantDeBERTa: An Open Source Language Model for Plant Science
- URL: http://arxiv.org/abs/2506.08897v4
- Date: Tue, 19 Aug 2025 13:22:49 GMT
- Title: PlantDeBERTa: An Open Source Language Model for Plant Science
- Authors: Hiba Khey, Amine Lakhder, Salma Rouichi, Imane El Ghabi, Kamal Hejjaoui, Younes En-nahli, Fahd Kalloubi, Moez Amri,
- Abstract summary: We present PlantDeBERTa, a high-performance, open-source language model for extracting structured knowledge from plant stress-response literature.<n>Our methodology combines transformer-based modeling with rule-enhanced linguistic post-processing and ontology-grounded entity normalization.<n>Our model is publicly released to promote transparency and accelerate cross-disciplinary innovation in computational plant science.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of transformer-based language models has catalyzed breakthroughs in biomedical and clinical natural language processing; however, plant science remains markedly underserved by such domain-adapted tools. In this work, we present PlantDeBERTa, a high-performance, open-source language model specifically tailored for extracting structured knowledge from plant stress-response literature. Built upon the DeBERTa architecture-known for its disentangled attention and robust contextual encoding-PlantDeBERTa is fine-tuned on a meticulously curated corpus of expert-annotated abstracts, with a primary focus on lentil (Lens culinaris) responses to diverse abiotic and biotic stressors. Our methodology combines transformer-based modeling with rule-enhanced linguistic post-processing and ontology-grounded entity normalization, enabling PlantDeBERTa to capture biologically meaningful relationships with precision and semantic fidelity. The underlying corpus is annotated using a hierarchical schema aligned with the Crop Ontology, encompassing molecular, physiological, biochemical, and agronomic dimensions of plant adaptation. PlantDeBERTa exhibits strong generalization capabilities across entity types and demonstrates the feasibility of robust domain adaptation in low-resource scientific fields.By providing a scalable and reproducible framework for high-resolution entity recognition, PlantDeBERTa bridges a critical gap in agricultural NLP and paves the way for intelligent, data-driven systems in plant genomics, phenomics, and agronomic knowledge discovery. Our model is publicly released to promote transparency and accelerate cross-disciplinary innovation in computational plant science.
Related papers
- Agentic reinforcement learning empowers next-generation chemical language models for molecular design and synthesis [51.83339196548892]
ChemCraft is a novel framework that decouples chemical reasoning from knowledge storage.<n>ChemCraft achieves superior performance with minimal inference costs.<n>This work establishes a cost-effective and privacy-preserving paradigm for AI-aided chemistry.
arXiv Detail & Related papers (2026-01-25T04:23:34Z) - AI Co-Scientist for Knowledge Synthesis in Medical Contexts: A Proof of Concept [0.0]
We present an AI for scalable and transparent knowledge synthesis based on explicit formalization of Population, Intervention, Comparator, Outcome, and Study design (PICOS)<n>The platform integrates relational storage, vector-based semantic retrieval, and a Neo4j knowledge graph.<n>Results show that PICOS-aware and explainable natural language processing can improve the scalability, transparency, and efficiency of evidence synthesis.
arXiv Detail & Related papers (2026-01-16T23:07:58Z) - Optimizing Agricultural Research: A RAG-Based Approach to Mycorrhizal Fungi Information [1.2349443032034277]
Retrieval-Augmented Generation (RAG) represents a transformative approach within natural language processing (NLP)<n>We present the design and evaluation of a RAG-enabled system tailored for Mycophyto.<n>The framework underscores the potential of AI-driven knowledge discovery to accelerate agroecological innovation and enhance decision-making in sustainable farming systems.
arXiv Detail & Related papers (2025-09-16T20:21:55Z) - CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis [51.56484100374058]
We introduce CellPainTR, a Transformer-based architecture designed to learn foundational representations of cellular morphology.<n>Our work represents a significant step towards creating truly foundational models for image-based profiling, enabling more reliable and scalable cross-study biological analysis.
arXiv Detail & Related papers (2025-09-02T03:30:07Z) - Generative Artificial Intelligence Extracts Structure-Function Relationships from Plants for New Materials [0.0]
Large language models (LLMs) have reshaped the research landscape by enabling new approaches to knowledge retrieval and creative ideation.<n>We present a first-of-its-kind framework that integrates generative AI with literature from hitherto-unconnected fields such as plant science, biomimetics, and materials engineering.<n>We focus on humidity-responsive systems such as pollen-based materials and Rhapis excelsa leaves, which exhibit self-actuation and adaptive performance.
arXiv Detail & Related papers (2025-08-08T10:41:03Z) - Bridging the Plausibility-Validity Gap by Fine-Tuning a Reasoning-Enhanced LLM for Chemical Synthesis and Discovery [0.0]
Large Language Models often generate scientifically plausible but factually invalid information.<n>This paper presents a systematic methodology to bridge this gap by developing a specialized scientific assistant.
arXiv Detail & Related papers (2025-07-09T23:05:23Z) - Biological Sequence with Language Model Prompting: A Survey [14.270959261105968]
Large Language models (LLMs) have emerged as powerful tools for addressing challenges across diverse domains.<n>This paper systematically investigates the application of prompt-based methods with LLMs to biological sequences.
arXiv Detail & Related papers (2025-03-06T06:28:36Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - BioMNER: A Dataset for Biomedical Method Entity Recognition [25.403593761614424]
We propose a novel dataset for biomedical method entity recognition.
We employ an automated BioMethod entity recognition and information retrieval system to assist human annotation.
Our empirical findings reveal that the large parameter counts of language models surprisingly inhibit the effective assimilation of entity extraction patterns.
arXiv Detail & Related papers (2024-06-28T16:34:24Z) - Synthesizing Proteins on the Graphics Card. Protein Folding and the Limits of Critical AI Studies [0.8192907805418581]
This paper investigates the application of the transformer architecture in protein folding.<n>We contend that our search for intelligent machines has to begin with the shape, rather than the place, of intelligence.
arXiv Detail & Related papers (2024-05-16T03:24:05Z) - Leveraging Biomolecule and Natural Language through Multi-Modal
Learning: A Survey [75.47055414002571]
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology.
We provide an analysis of recent advancements achieved through cross modeling of biomolecules and natural language.
arXiv Detail & Related papers (2024-03-03T14:59:47Z) - BonnBeetClouds3D: A Dataset Towards Point Cloud-based Organ-level
Phenotyping of Sugar Beet Plants under Field Conditions [30.27773980916216]
Agricultural production is facing severe challenges in the next decades induced by climate change and the need for sustainability.
Advancements in field management through non-chemical weeding by robots in combination with monitoring of crops by autonomous unmanned aerial vehicles (UAVs) are helpful to address these challenges.
The analysis of plant traits, called phenotyping, is an essential activity in plant breeding, it however involves a great amount of manual labor.
arXiv Detail & Related papers (2023-12-22T14:06:44Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - High-throughput Biomedical Relation Extraction for Semi-Structured Web Articles Empowered by Large Language Models [1.9665865095034865]
We formulate the relation extraction task as binary classifications for large language models.
We designate the main title as the tail entity and explicitly incorporate it into the context.
Longer contents are sliced into text chunks, embedded, and retrieved with additional embedding models.
arXiv Detail & Related papers (2023-12-13T16:43:41Z) - Improving Biomedical Abstractive Summarisation with Knowledge
Aggregation from Citation Papers [24.481854035628434]
Existing language models struggle to generate technical summaries that are on par with those produced by biomedical experts.
We propose a novel attention-based citation aggregation model that integrates domain-specific knowledge from citation papers.
Our model outperforms state-of-the-art approaches and achieves substantial improvements in abstractive biomedical text summarisation.
arXiv Detail & Related papers (2023-10-24T09:56:46Z) - Semantic Image Segmentation with Deep Learning for Vine Leaf Phenotyping [59.0626764544669]
In this study, we use Deep Learning methods to semantically segment grapevine leaves images in order to develop an automated object detection system for leaf phenotyping.
Our work contributes to plant lifecycle monitoring through which dynamic traits such as growth and development can be captured and quantified.
arXiv Detail & Related papers (2022-10-24T14:37:09Z) - Sparse*BERT: Sparse Models Generalize To New tasks and Domains [79.42527716035879]
This paper studies how models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks.
We demonstrate that our general sparse model Sparse*BERT can become SparseBioBERT simply by pretraining the compressed architecture on unstructured biomedical text.
arXiv Detail & Related papers (2022-05-25T02:51:12Z) - Fine-Tuning Large Neural Language Models for Biomedical Natural Language
Processing [55.52858954615655]
We conduct a systematic study on fine-tuning stability in biomedical NLP.
We show that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains.
We show that these techniques can substantially improve fine-tuning performance for lowresource biomedical NLP applications.
arXiv Detail & Related papers (2021-12-15T04:20:35Z) - Multimodal Graph-based Transformer Framework for Biomedical Relation
Extraction [21.858440542249934]
We introduce a novel framework that enables the model to learn multi-omnics biological information about entities (proteins) with the help of additional multi-modal cues like molecular structure.
We evaluate our proposed method on ProteinProtein Interaction task from the biomedical corpus.
arXiv Detail & Related papers (2021-07-01T16:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.