Using text embedding models and vector databases as text classifiers
with the example of medical data
- URL: http://arxiv.org/abs/2402.16886v1
- Date: Wed, 7 Feb 2024 22:15:15 GMT
- Title: Using text embedding models and vector databases as text classifiers
with the example of medical data
- Authors: Rishabh Goel
- Abstract summary: We explore the use of vector databases and embedding models as a means of encoding, and classifying text with the example and application in the field of medicine.
We show the robustness of these tools depends heavily on the sparsity of the data presented, and even with low amounts of data in the vector database itself, the vector database does a good job at classifying data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent of Large Language Models (LLMs) is promising and has found
application in numerous fields, but as it often is with the medical field, the
bar is typically quite high [5]. In tandem with LLMs, vector embedding models
and vector databases provide a robust way of expressing numerous modes of data
that are easily digestible by typical machine learning models. Along with the
ease of adding information, knowledge, and data to these vector databases, they
provide a compelling reason to apply them in numerous fields where the task of
retrieving information is typically done by humans. Researchers at Google have
developed a clear alternative model, Med-PaLM [6] specifically designed to
match a clinician's level of accuracy when it comes to medical knowledge. When
training classifiers, and developing models, it is imperative to maintain
factuality and reduce bias [4]. Here, we explore the use of vector databases
and embedding models as a means of encoding, and classifying text with the
example and application in the field of medicine. We show the robustness of
these tools depends heavily on the sparsity of the data presented, and even
with low amounts of data in the vector database itself, the vector database
does a good job at classifying data [9]. Using various LLMs to generate the
medical data, we also understand the limitations of the medical knowledge of
these models and encourage further expert medical review of our testing data.
By using vector databases to classify a clinician's notes on a patient
presented with a certain ailment, we understand the limitations of such
methods, but also the promise of their prospective use and with continued
testing and experimentation, hope to explore a unique use case of vector
databases and embedding models.
Related papers
- Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)
We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy.
We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z) - When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications? [8.89829757177796]
We examine the effectiveness of vector representations from last hidden states of Large Language Models for medical diagnostics and prognostics.
We focus on instruction-tuned LLMs in a zero-shot setting to represent abnormal physiological data and evaluate their utilities as feature extractors.
Although findings suggest the raw data features still prevails in medical ML tasks, zero-shot LLM embeddings demonstrate competitive results.
arXiv Detail & Related papers (2024-08-15T03:56:40Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Enhancing Medical Specialty Assignment to Patients using NLP Techniques [0.0]
We propose an alternative approach that achieves superior performance while being computationally efficient.
Specifically, we utilize keywords to train a deep learning architecture that outperforms a language model pretrained on a large corpus of text.
Our results demonstrate that utilizing keywords for text classification significantly improves classification performance.
arXiv Detail & Related papers (2023-12-09T14:13:45Z) - Sample Size in Natural Language Processing within Healthcare Research [0.14865681381012494]
Lack of sufficient corpora of previously collected data can be a limiting factor when determining sample sizes for new studies.
This paper tries to address the issue by making recommendations on sample sizes for text classification tasks in the healthcare domain.
arXiv Detail & Related papers (2023-09-05T13:42:43Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians.
Recent studies have achieved promising results in automatic impression generation using large-scale medical text data.
These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z) - Towards Understanding the Generalization of Medical Text-to-SQL Models
and Datasets [46.12592636378064]
We show that there is still a long way to go before solving text-to-generation in the medical domain.
We evaluate state-of-the-art language models showing substantial drops in performance with accuracy dropping from up to 92% to 28%.
We introduce a novel data augmentation approach to improve the generalizability of relational language models.
arXiv Detail & Related papers (2023-03-22T20:26:30Z) - RuMedBench: A Russian Medical Language Understanding Benchmark [58.99199480170909]
The paper describes the open Russian medical language understanding benchmark covering several task types.
We prepare the unified format labeling, data split, and evaluation metrics for new tasks.
A single-number metric expresses a model's ability to cope with the benchmark.
arXiv Detail & Related papers (2022-01-17T16:23:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.