Using text embedding models and vector databases as text classifiers
with the example of medical data
- URL: http://arxiv.org/abs/2402.16886v1
- Date: Wed, 7 Feb 2024 22:15:15 GMT
- Title: Using text embedding models and vector databases as text classifiers
with the example of medical data
- Authors: Rishabh Goel
- Abstract summary: We explore the use of vector databases and embedding models as a means of encoding, and classifying text with the example and application in the field of medicine.
We show the robustness of these tools depends heavily on the sparsity of the data presented, and even with low amounts of data in the vector database itself, the vector database does a good job at classifying data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent of Large Language Models (LLMs) is promising and has found
application in numerous fields, but as it often is with the medical field, the
bar is typically quite high [5]. In tandem with LLMs, vector embedding models
and vector databases provide a robust way of expressing numerous modes of data
that are easily digestible by typical machine learning models. Along with the
ease of adding information, knowledge, and data to these vector databases, they
provide a compelling reason to apply them in numerous fields where the task of
retrieving information is typically done by humans. Researchers at Google have
developed a clear alternative model, Med-PaLM [6] specifically designed to
match a clinician's level of accuracy when it comes to medical knowledge. When
training classifiers, and developing models, it is imperative to maintain
factuality and reduce bias [4]. Here, we explore the use of vector databases
and embedding models as a means of encoding, and classifying text with the
example and application in the field of medicine. We show the robustness of
these tools depends heavily on the sparsity of the data presented, and even
with low amounts of data in the vector database itself, the vector database
does a good job at classifying data [9]. Using various LLMs to generate the
medical data, we also understand the limitations of the medical knowledge of
these models and encourage further expert medical review of our testing data.
By using vector databases to classify a clinician's notes on a patient
presented with a certain ailment, we understand the limitations of such
methods, but also the promise of their prospective use and with continued
testing and experimentation, hope to explore a unique use case of vector
databases and embedding models.
Related papers
- When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications? [8.89829757177796]
We examine the effectiveness of vector representations from last hidden states of Large Language Models for medical diagnostics and prognostics.
We focus on instruction-tuned LLMs in a zero-shot setting to represent abnormal physiological data and evaluate their utilities as feature extractors.
Although findings suggest the raw data features still prevails in medical ML tasks, zero-shot LLM embeddings demonstrate competitive results.
arXiv Detail & Related papers (2024-08-15T03:56:40Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Generalist embedding models are better at short-context clinical
semantic search than specialized embedding models [0.9296448006507203]
We construct a dataset based on the ICD-10-CM code descriptions and their easily reproducible rephrasing.
We benchmarked existing embedding models, either generalist or specialized in the clinical domain, in a semantic search task.
Results showed that generalist models performed better than clinical models, suggesting that existing clinical specialized models are more sensitive to small changes in input that confuse them.
arXiv Detail & Related papers (2024-01-03T19:03:32Z) - Enhancing Medical Specialty Assignment to Patients using NLP Techniques [0.0]
We propose an alternative approach that achieves superior performance while being computationally efficient.
Specifically, we utilize keywords to train a deep learning architecture that outperforms a language model pretrained on a large corpus of text.
Our results demonstrate that utilizing keywords for text classification significantly improves classification performance.
arXiv Detail & Related papers (2023-12-09T14:13:45Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians.
Recent studies have achieved promising results in automatic impression generation using large-scale medical text data.
These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z) - A Meta-embedding-based Ensemble Approach for ICD Coding Prediction [64.42386426730695]
International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding.
These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information.
Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles.
arXiv Detail & Related papers (2021-02-26T17:49:58Z) - PharmKE: Knowledge Extraction Platform for Pharmaceutical Texts using
Transfer Learning [0.0]
PharmKE is a text analysis platform that applies deep learning through several stages for thorough semantic analysis of pharmaceutical articles.
The methodology is used to create accurately labeled training and test datasets, which are then used to train models for custom entity labeling tasks.
The obtained results are compared to the fine-tuned BERT and BioBERT models trained on the same dataset.
arXiv Detail & Related papers (2021-02-25T19:36:35Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Self-Training with Improved Regularization for Sample-Efficient Chest
X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios.
Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.