Analise Semantica Automatizada com LLM e RAG para Bulas Farmaceuticas
- URL: http://arxiv.org/abs/2507.21103v1
- Date: Mon, 07 Jul 2025 17:48:15 GMT
- Title: Analise Semantica Automatizada com LLM e RAG para Bulas Farmaceuticas
- Authors: Daniel Meireles do Rego,
- Abstract summary: This work investigates the use of RAG (Retrieval-Augmented Generation) architectures combined with Large-Scale Language Models (LLMs) to automate the analysis of documents in PDF format.<n>The proposal integrates vector search techniques by embeddings, semantic data extraction and generation of contextualized natural language responses.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The production of digital documents has been growing rapidly in academic, business, and health environments, presenting new challenges in the efficient extraction and analysis of unstructured information. This work investigates the use of RAG (Retrieval-Augmented Generation) architectures combined with Large-Scale Language Models (LLMs) to automate the analysis of documents in PDF format. The proposal integrates vector search techniques by embeddings, semantic data extraction and generation of contextualized natural language responses. To validate the approach, we conducted experiments with drug package inserts extracted from official public sources. The semantic queries applied were evaluated by metrics such as accuracy, completeness, response speed and consistency. The results indicate that the combination of RAG with LLMs offers significant gains in intelligent information retrieval and interpretation of unstructured technical texts.
Related papers
- Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science [3.4325249294405555]
This study applies Large Language Models (LLMs) to two foundational Electronic Health Record (EHR) data science tasks.<n>We test the ability of LLMs to interact accurately with large structured datasets for analytics.<n>We present a flexible evaluation framework that automatically generates synthetic question and answer pairs tailored to the characteristics of each dataset or task.
arXiv Detail & Related papers (2026-01-28T14:57:36Z) - DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z) - LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature [60.879220305044726]
We propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data.<n>We curated 81k open-access papers, yielding LeMat- Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes.<n>We release a modular, open-source library designed to support community-driven extension to new corpora and synthesis domains.
arXiv Detail & Related papers (2025-10-28T17:58:18Z) - Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles [81.89404347890662]
SciTrek is a novel question-answering benchmark designed to evaluate the long-context reasoning capabilities of large language models (LLMs) using scientific articles.<n>Our analysis reveals systematic shortcomings in models' abilities to perform basic numerical operations and accurately locate specific information in long contexts.
arXiv Detail & Related papers (2025-09-25T11:36:09Z) - LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology [3.470217255779291]
We introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis.<n>Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries.<n> Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful agent responses.
arXiv Detail & Related papers (2025-09-17T13:51:29Z) - HySemRAG: A Hybrid Semantic Retrieval-Augmented Generation Framework for Automated Literature Synthesis and Methodological Gap Analysis [55.2480439325792]
HySemRAG is a framework that combines Extract, Transform, Load (ETL) pipelines with Retrieval-Augmented Generation (RAG)<n>System addresses limitations in existing RAG architectures through a multi-layered approach.
arXiv Detail & Related papers (2025-08-01T20:30:42Z) - StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation [8.251302684712773]
StructText is an end-to-end framework for automatically generating high-fidelity benchmarks for key-value extraction from text.<n>We evaluate the proposed method on 71,539 examples across 49 documents.
arXiv Detail & Related papers (2025-07-28T21:20:44Z) - ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z) - MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - Zero-Shot Document-Level Biomedical Relation Extraction via Scenario-based Prompt Design in Two-Stage with LLM [7.808231572590279]
We propose a novel approach to achieve the same results from unannotated full documents using general large language models (LLMs) with lower hardware and labor costs.<n>Our approach combines two major stages: named entity recognition (NER) and relation extraction (RE)<n>To enhance the effectiveness of prompt, we propose a five-part template structure and a scenario-based prompt design principles.
arXiv Detail & Related papers (2025-05-02T07:33:20Z) - Can LLMs Help Uncover Insights about LLMs? A Large-Scale, Evolving Literature Analysis of Frontier LLMs [32.48924329288906]
This study presents a semi-automated approach for literature analysis that accelerates data extraction using LLMs.<n>It automatically identifies relevant arXiv papers, extracts experimental results and related attributes, and organizes them into a structured dataset, LLMEvalDB.<n>We then conduct an automated literature analysis of frontier LLMs, reducing the effort of paper surveying and data extraction by more than 93% compared to manual approaches.
arXiv Detail & Related papers (2025-02-26T03:56:34Z) - GeAR: Generation Augmented Retrieval [82.20696567697016]
This paper introduces a novel method, $textbfGe$neration.<n>It improves the global document-Query similarity through contrastive learning, but also integrates well-designed fusion and decoding modules.<n>When used as a retriever, GeAR does not incur any additional computational cost over bi-encoders.
arXiv Detail & Related papers (2025-01-06T05:29:00Z) - Enhancing Spectral Knowledge Interrogation: A Reliable Retrieval-Augmented Generative Framework on Large Language Models [0.0]
Large Language Model (LLM) has demonstrated significant success in a range of natural language processing (NLP) tasks within general domain.<n>We introduce the Spectral Detection and Analysis Based Paper (SDAAP) dataset, which is the first open-source textual knowledge dataset for spectral analysis and detection.<n>We also designed an automated Q&A framework based on the SDAAP dataset, which can retrieve relevant knowledge and generate high-quality responses.
arXiv Detail & Related papers (2024-08-21T12:09:37Z) - Integrating Large Language Models and Knowledge Graphs for Extraction and Validation of Textual Test Data [3.114910206366326]
Aerospace manufacturing companies, such as Thales Alenia Space, design, develop, integrate, verify, and validate products.
We propose a hybrid methodology that leverages Knowledge Graphs (KGs) in conjunction with Large Language Models (LLMs) to extract and validate data.
arXiv Detail & Related papers (2024-08-03T07:42:53Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.