Latent Feature Mining for Predictive Model Enhancement with Large Language Models
- URL: http://arxiv.org/abs/2410.04347v1
- Date: Sun, 6 Oct 2024 03:51:32 GMT
- Title: Latent Feature Mining for Predictive Model Enhancement with Large Language Models
- Authors: Bingxuan Li, Pengyi Shi, Amy Ward,
- Abstract summary: We introduce an effective approach to formulate latent feature mining as text-to-text propositional logical reasoning.
We propose FLAME, a framework that leverages large language models (LLMs) to augment observed features with latent features.
We validate our framework with two case studies: the criminal justice system and the healthcare domain.
- Score: 2.6334346517416876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predictive modeling often faces challenges due to limited data availability and quality, especially in domains where collected features are weakly correlated with outcomes and where additional feature collection is constrained by ethical or practical difficulties. Traditional machine learning (ML) models struggle to incorporate unobserved yet critical factors. In this work, we introduce an effective approach to formulate latent feature mining as text-to-text propositional logical reasoning. We propose FLAME (Faithful Latent Feature Mining for Predictive Model Enhancement), a framework that leverages large language models (LLMs) to augment observed features with latent features and enhance the predictive power of ML models in downstream tasks. Our framework is generalizable across various domains with necessary domain-specific adaptation, as it is designed to incorporate contextual information unique to each area, ensuring effective transfer to different areas facing similar data availability challenges. We validate our framework with two case studies: (1) the criminal justice system, a domain characterized by limited and ethically challenging data collection; (2) the healthcare domain, where patient privacy concerns and the complexity of medical data limit comprehensive feature collection. Our results show that inferred latent features align well with ground truth labels and significantly enhance the downstream classifier.
Related papers
- Health AI Developer Foundations [18.690656891269686]
Health AI Developer Foundations (HAI-DEF) is a suite of pre-trained, domain-specific foundation models, tools, and recipes to accelerate building Machine Learning for health applications.
Models cover various modalities and domains, including radiology (X-rays and computed tomography), histopathology, dermatological imaging, and audio.
These models provide domain specific embeddings that facilitate AI development with less labeled data, shorter training times, and reduced computational costs.
arXiv Detail & Related papers (2024-11-22T18:51:51Z) - Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - KIPPS: Knowledge infusion in Privacy Preserving Synthetic Data
Generation [0.0]
Generative Deep Learning models struggle to model discrete and non-Gaussian features that have domain constraints.
Generative models create synthetic data that repeats sensitive features, which is a privacy risk.
This paper proposes a novel model, KIPPS, that infuses Domain and Regulatory Knowledge from Knowledge Graphs into Generative Deep Learning models for enhanced Privacy Preserving Synthetic data generation.
arXiv Detail & Related papers (2024-09-25T19:50:03Z) - Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification [2.5091334993691206]
Development of a robust deep-learning model for retinal disease diagnosis requires a substantial dataset for training.
The capacity to generalize effectively on smaller datasets remains a persistent challenge.
We've combined a wide range of data sources to improve performance and generalization to new data.
arXiv Detail & Related papers (2024-09-17T17:22:35Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - Towards Precision Healthcare: Robust Fusion of Time Series and Image Data [8.579651833717763]
We introduce a new method that uses two separate encoders, one for each type of data, allowing the model to understand complex patterns in both visual and time-based information.
We also deal with imbalanced datasets and use an uncertainty loss function, yielding improved results.
Our experiments show that our method is effective in improving multimodal deep learning for clinical applications.
arXiv Detail & Related papers (2024-05-24T11:18:13Z) - Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse [4.98050508891467]
We propose a two-stage approach for the construction of production prompts designed to yield high-quality data.
This method involves the generation of a diverse array of prompts that encompass a broad spectrum of tasks and exhibit a rich variety of expressions.
We introduce a cost-effective, multi-dimensional quality assessment framework to ensure the integrity of the generated labeling data.
arXiv Detail & Related papers (2024-03-14T08:27:32Z) - Prospector Heads: Generalized Feature Attribution for Large Models & Data [82.02696069543454]
We introduce prospector heads, an efficient and interpretable alternative to explanation-based attribution methods.
We demonstrate how prospector heads enable improved interpretation and discovery of class-specific patterns in input data.
arXiv Detail & Related papers (2024-02-18T23:01:28Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Generative Counterfactuals for Neural Networks via Attribute-Informed
Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP)
By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently.
Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.