Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
- URL: http://arxiv.org/abs/2412.16247v2
- Date: Tue, 11 Feb 2025 16:54:45 GMT
- Title: Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
- Authors: Konstantin Donhauser, Kristina Ulicna, Gemma Elyse Moran, Aditya Ravuri, Kian Kenyon-Dean, Cian Eastwood, Jason Hartford,
- Abstract summary: We propose Iterative Codebook Feature Learning(ICFL) and combine it with a pre-processing step which uses PCA whitening from a control dataset.
We show that both ICFL and PCA improve the selectivity of extracted features compared to TopK sparse autoencoders.
- Score: 6.136186137141521
- License:
- Abstract: Dictionary learning (DL) has emerged as a powerful interpretability tool for large language models. By extracting known concepts (e.g., Golden-Gate Bridge) from human-interpretable data (e.g., text), sparse DL can elucidate a model's inner workings. In this work, we ask if DL can also be used to discover unknown concepts from less human-interpretable scientific data (e.g., cell images), ultimately enabling modern approaches to scientific discovery. As a first step, we use DL algorithms to study microscopy foundation models trained on multi-cell image data, where little prior knowledge exists regarding which high-level concepts should arise. We show that sparse dictionaries indeed extract biologically-meaningful concepts such as cell type and genetic perturbation type. We also propose Iterative Codebook Feature Learning~(ICFL) and combine it with a pre-processing step which uses PCA whitening from a control dataset. In our experiments, we demonstrate that both ICFL and PCA improve the selectivity of extracted features compared to TopK sparse autoencoders.
Related papers
- Building, Reusing, and Generalizing Abstract Representations from Concrete Sequences [51.965994405124455]
Humans excel at learning abstract patterns across different sequences, filtering out irrelevant details.
Many sequence learning models lack the ability to abstract, which leads to memory inefficiency and poor transfer.
We introduce a non-parametric hierarchical variable learning model (HVM) that learns chunks from sequences and abstracts contextually similar chunks as variables.
arXiv Detail & Related papers (2024-10-27T18:13:07Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Transparency at the Source: Evaluating and Interpreting Language Models
With Access to the True Distribution [4.01799362940916]
We present a setup for training, evaluating and interpreting neural language models, that uses artificial, language-like data.
The data is generated using a massive probabilistic grammar, that is itself derived from a large natural language corpus.
With access to the underlying true source, our results show striking differences and outcomes in learning dynamics between different classes of words.
arXiv Detail & Related papers (2023-10-23T12:03:01Z) - Incorporating Dictionaries into a Neural Network Architecture to Extract
COVID-19 Medical Concepts From Social Media [0.2302001830524133]
We investigate the potential benefit of incorporating dictionary information into a neural network architecture for natural language processing.
In particular, we make use of this architecture to extract several concepts related to COVID-19 from an on-line medical forum.
Our results show that incorporating small domain dictionaries to deep learning models can improve concept extraction tasks.
arXiv Detail & Related papers (2023-09-05T12:47:44Z) - Lattice-preserving $\mathcal{ALC}$ ontology embeddings with saturation [50.05281461410368]
An order-preserving embedding method is proposed to generate embeddings of OWL representations.
We show that our method outperforms state-the-art theory-of-the-art embedding methods in several knowledge base completion tasks.
arXiv Detail & Related papers (2023-05-11T22:27:51Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Open Vocabulary Electroencephalography-To-Text Decoding and Zero-shot
Sentiment Classification [78.120927891455]
State-of-the-art brain-to-text systems have achieved great success in decoding language directly from brain signals using neural networks.
In this paper, we extend the problem to open vocabulary Electroencephalography(EEG)-To-Text Sequence-To-Sequence decoding and zero-shot sentence sentiment classification on natural reading tasks.
Our model achieves a 40.1% BLEU-1 score on EEG-To-Text decoding and a 55.6% F1 score on zero-shot EEG-based ternary sentiment classification, which significantly outperforms supervised baselines.
arXiv Detail & Related papers (2021-12-05T21:57:22Z) - Improving Biomedical Pretrained Language Models with Knowledge [22.61591249168801]
We propose KeBioLM, a biomedical pretrained language model that explicitly leverages knowledge from the UMLS knowledge bases.
Specifically, we extract entities from PubMed abstracts and link them to UMLS.
We then train a knowledge-aware language model that firstly applies a text-only encoding layer to learn entity representation and applies a text-entity fusion encoding to aggregate entity representation.
arXiv Detail & Related papers (2021-04-21T03:57:26Z) - Latent Feature Representation via Unsupervised Learning for Pattern
Discovery in Massive Electron Microscopy Image Volumes [4.278591555984395]
In particular, we give an unsupervised deep learning approach to learning a latent representation that captures semantic similarity in the data set.
We demonstrate the utility of our method applied to nano-scale electron microscopy data, where even relatively small portions of animal brains can require terabytes of image data.
arXiv Detail & Related papers (2020-12-22T17:14:19Z) - Reprogramming Language Models for Molecular Representation Learning [65.00999660425731]
We propose Representation Reprogramming via Dictionary Learning (R2DL) for adversarially reprogramming pretrained language models for molecular learning tasks.
The adversarial program learns a linear transformation between a dense source model input space (language data) and a sparse target model input space (e.g., chemical and biological molecule data) using a k-SVD solver.
R2DL achieves the baseline established by state of the art toxicity prediction models trained on domain-specific data and outperforms the baseline in a limited training-data setting.
arXiv Detail & Related papers (2020-12-07T05:50:27Z) - The Interpretable Dictionary in Sparse Coding [4.205692673448206]
In our work, we illustrate that an ANN, trained using sparse coding under specific sparsity constraints, yields a more interpretable model than the standard deep learning model.
The dictionary learned by sparse coding can be more easily understood and the activations of these elements creates a selective feature output.
arXiv Detail & Related papers (2020-11-24T00:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.