Related papers: Using text embedding models and vector databases as text classifiers with the example of medical data

Using text embedding models and vector databases as text classifiers with the example of medical data

URL: http://arxiv.org/abs/2402.16886v1
Date: Wed, 7 Feb 2024 22:15:15 GMT
Title: Using text embedding models and vector databases as text classifiers with the example of medical data
Authors: Rishabh Goel
Abstract summary: We explore the use of vector databases and embedding models as a means of encoding, and classifying text with the example and application in the field of medicine. We show the robustness of these tools depends heavily on the sparsity of the data presented, and even with low amounts of data in the vector database itself, the vector database does a good job at classifying data.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The advent of Large Language Models (LLMs) is promising and has found application in numerous fields, but as it often is with the medical field, the bar is typically quite high [5]. In tandem with LLMs, vector embedding models and vector databases provide a robust way of expressing numerous modes of data that are easily digestible by typical machine learning models. Along with the ease of adding information, knowledge, and data to these vector databases, they provide a compelling reason to apply them in numerous fields where the task of retrieving information is typically done by humans. Researchers at Google have developed a clear alternative model, Med-PaLM [6] specifically designed to match a clinician's level of accuracy when it comes to medical knowledge. When training classifiers, and developing models, it is imperative to maintain factuality and reduce bias [4]. Here, we explore the use of vector databases and embedding models as a means of encoding, and classifying text with the example and application in the field of medicine. We show the robustness of these tools depends heavily on the sparsity of the data presented, and even with low amounts of data in the vector database itself, the vector database does a good job at classifying data [9]. Using various LLMs to generate the medical data, we also understand the limitations of the medical knowledge of these models and encourage further expert medical review of our testing data. By using vector databases to classify a clinician's notes on a patient presented with a certain ailment, we understand the limitations of such methods, but also the promise of their prospective use and with continued testing and experimentation, hope to explore a unique use case of vector databases and embedding models.

Related papers

Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs) We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications? [8.89829757177796]
We examine the effectiveness of vector representations from last hidden states of Large Language Models for medical diagnostics and prognostics. We focus on instruction-tuned LLMs in a zero-shot setting to represent abnormal physiological data and evaluate their utilities as feature extractors. Although findings suggest the raw data features still prevails in medical ML tasks, zero-shot LLM embeddings demonstrate competitive results.
arXiv Detail & Related papers (2024-08-15T03:56:40Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
Generalist embedding models are better at short-context clinical semantic search than specialized embedding models [0.9296448006507203]
We construct a dataset based on the ICD-10-CM code descriptions and their easily reproducible rephrasing. We benchmarked existing embedding models, either generalist or specialized in the clinical domain, in a semantic search task. Results showed that generalist models performed better than clinical models, suggesting that existing clinical specialized models are more sensitive to small changes in input that confuse them.
arXiv Detail & Related papers (2024-01-03T19:03:32Z)
Enhancing Medical Specialty Assignment to Patients using NLP Techniques [0.0]
We propose an alternative approach that achieves superior performance while being computationally efficient. Specifically, we utilize keywords to train a deep learning architecture that outperforms a language model pretrained on a large corpus of text. Our results demonstrate that utilizing keywords for text classification significantly improves classification performance.
arXiv Detail & Related papers (2023-12-09T14:13:45Z)
Sample Size in Natural Language Processing within Healthcare Research [0.14865681381012494]
Lack of sufficient corpora of previously collected data can be a limiting factor when determining sample sizes for new studies. This paper tries to address the issue by making recommendations on sample sizes for text classification tasks in the healthcare domain.
arXiv Detail & Related papers (2023-09-05T13:42:43Z)
Interpretable Medical Diagnostics with Structured Data Extraction by Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports. We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM. We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z)
An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians. Recent studies have achieved promising results in automatic impression generation using large-scale medical text data. These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z)
Towards Understanding the Generalization of Medical Text-to-SQL Models and Datasets [46.12592636378064]
We show that there is still a long way to go before solving text-to-generation in the medical domain. We evaluate state-of-the-art language models showing substantial drops in performance with accuracy dropping from up to 92% to 28%. We introduce a novel data augmentation approach to improve the generalizability of relational language models.
arXiv Detail & Related papers (2023-03-22T20:26:30Z)
RuMedBench: A Russian Medical Language Understanding Benchmark [58.99199480170909]
The paper describes the open Russian medical language understanding benchmark covering several task types. We prepare the unified format labeling, data split, and evaluation metrics for new tasks. A single-number metric expresses a model's ability to cope with the benchmark.
arXiv Detail & Related papers (2022-01-17T16:23:33Z)
A Meta-embedding-based Ensemble Approach for ICD Coding Prediction [64.42386426730695]
International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding. These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information. Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles.
arXiv Detail & Related papers (2021-02-26T17:49:58Z)
PharmKE: Knowledge Extraction Platform for Pharmaceutical Texts using Transfer Learning [0.0]
PharmKE is a text analysis platform that applies deep learning through several stages for thorough semantic analysis of pharmaceutical articles. The methodology is used to create accurately labeled training and test datasets, which are then used to train models for custom entity labeling tasks. The obtained results are compared to the fine-tuned BERT and BioBERT models trained on the same dataset.
arXiv Detail & Related papers (2021-02-25T19:36:35Z)
Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients. We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks. Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
Self-Training with Improved Regularization for Sample-Efficient Chest X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios. Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.