Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
- URL: http://arxiv.org/abs/2412.16247v2
- Date: Tue, 11 Feb 2025 16:54:45 GMT
- Title: Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
- Authors: Konstantin Donhauser, Kristina Ulicna, Gemma Elyse Moran, Aditya Ravuri, Kian Kenyon-Dean, Cian Eastwood, Jason Hartford,
- Abstract summary: We propose Iterative Codebook Feature Learning(ICFL) and combine it with a pre-processing step which uses PCA whitening from a control dataset.<n>We show that both ICFL and PCA improve the selectivity of extracted features compared to TopK sparse autoencoders.
- Score: 6.136186137141521
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dictionary learning (DL) has emerged as a powerful interpretability tool for large language models. By extracting known concepts (e.g., Golden-Gate Bridge) from human-interpretable data (e.g., text), sparse DL can elucidate a model's inner workings. In this work, we ask if DL can also be used to discover unknown concepts from less human-interpretable scientific data (e.g., cell images), ultimately enabling modern approaches to scientific discovery. As a first step, we use DL algorithms to study microscopy foundation models trained on multi-cell image data, where little prior knowledge exists regarding which high-level concepts should arise. We show that sparse dictionaries indeed extract biologically-meaningful concepts such as cell type and genetic perturbation type. We also propose Iterative Codebook Feature Learning~(ICFL) and combine it with a pre-processing step which uses PCA whitening from a control dataset. In our experiments, we demonstrate that both ICFL and PCA improve the selectivity of extracted features compared to TopK sparse autoencoders.
Related papers
- Concept-Guided Interpretability via Neural Chunking [54.73787666584143]
We show that neural networks exhibit patterns in their raw population activity that mirror regularities in the training data.<n>We propose three methods to extract these emerging entities, complementing each other based on label availability and dimensionality.<n>Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data.
arXiv Detail & Related papers (2025-05-16T13:49:43Z) - Discovering Chunks in Neural Embeddings for Interpretability [53.80157905839065]
We propose leveraging the principle of chunking to interpret artificial neural population activities.<n>We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities.<n>We identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts.
arXiv Detail & Related papers (2025-02-03T20:30:46Z) - Human-like conceptual representations emerge from language prediction [72.5875173689788]
Large language models (LLMs) trained exclusively through next-token prediction over language data exhibit remarkably human-like behaviors.<n>Are these models developing concepts akin to humans, and if so, how are such concepts represented and organized?<n>Our results demonstrate that LLMs can flexibly derive concepts from linguistic descriptions in relation to contextual cues about other concepts.<n>These findings establish that structured, human-like conceptual representations can naturally emerge from language prediction without real-world grounding.
arXiv Detail & Related papers (2025-01-21T23:54:17Z) - Towards Unified Molecule-Enhanced Pathology Image Representation Learning via Integrating Spatial Transcriptomics [10.962389869127875]
We introduce a Unified Molecule-enhanced Pathology Image REpresentationn Learning framework (UMPIRE)<n>UMPIRE aims to leverage complementary information from gene expression profiles to guide the multimodal pre-training.<n>We demonstrate that this molecular perspective provides a robust, task-agnostic training signal for learning pathology image embeddings.
arXiv Detail & Related papers (2024-12-01T03:09:52Z) - Building, Reusing, and Generalizing Abstract Representations from Concrete Sequences [51.965994405124455]
Humans excel at learning abstract patterns across different sequences, filtering out irrelevant details.
Many sequence learning models lack the ability to abstract, which leads to memory inefficiency and poor transfer.
We introduce a non-parametric hierarchical variable learning model (HVM) that learns chunks from sequences and abstracts contextually similar chunks as variables.
arXiv Detail & Related papers (2024-10-27T18:13:07Z) - Towards Ontology-Enhanced Representation Learning for Large Language Models [0.18416014644193066]
We propose a novel approach to improve an embedding-Large Language Model (embedding-LLM) of interest by infusing knowledge by a reference ontology.
The linguistic information (i.e. concept synonyms and descriptions) and structural information (i.e. is-a relations) are utilized to compile a comprehensive set of concept definitions.
These concept definitions are then employed to fine-tune the target embedding-LLM using a contrastive learning framework.
arXiv Detail & Related papers (2024-05-30T23:01:10Z) - A quantitative analysis of knowledge-learning preferences in large language models in molecular science [24.80165173525286]
Large language models (LLMs) introduce a fresh research paradigm to tackle scientific problems from a natural language processing (NLP) perspective.<n>LLMs significantly enhance our understanding and generation of molecules, often surpassing existing methods with their capabilities to decode and synthesize complex molecular patterns.<n>We propose a multi-modal benchmark, named ChEBI-20-MM, and perform 1263 experiments to assess the model's compatibility with data modalities and knowledge acquisition.
arXiv Detail & Related papers (2024-02-06T16:12:36Z) - MLIP: Enhancing Medical Visual Representation with Divergence Encoder
and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning.
Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge.
Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks.
The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation.
We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z) - Transparency at the Source: Evaluating and Interpreting Language Models
With Access to the True Distribution [4.01799362940916]
We present a setup for training, evaluating and interpreting neural language models, that uses artificial, language-like data.
The data is generated using a massive probabilistic grammar, that is itself derived from a large natural language corpus.
With access to the underlying true source, our results show striking differences and outcomes in learning dynamics between different classes of words.
arXiv Detail & Related papers (2023-10-23T12:03:01Z) - Incorporating Dictionaries into a Neural Network Architecture to Extract
COVID-19 Medical Concepts From Social Media [0.2302001830524133]
We investigate the potential benefit of incorporating dictionary information into a neural network architecture for natural language processing.
In particular, we make use of this architecture to extract several concepts related to COVID-19 from an on-line medical forum.
Our results show that incorporating small domain dictionaries to deep learning models can improve concept extraction tasks.
arXiv Detail & Related papers (2023-09-05T12:47:44Z) - Substance or Style: What Does Your Image Embedding Know? [55.676463077772866]
Image foundation models have primarily been evaluated for semantic content.
We measure the visual content of embeddings along many axes, including image style, quality, and a range of natural and artificial transformations.
We find that image-text models (CLIP and ALIGN) are better at recognizing new examples of style transfer than masking-based models (CAN and MAE)
arXiv Detail & Related papers (2023-07-10T22:40:10Z) - Lattice-preserving $\mathcal{ALC}$ ontology embeddings with saturation [50.05281461410368]
An order-preserving embedding method is proposed to generate embeddings of OWL representations.
We show that our method outperforms state-the-art theory-of-the-art embedding methods in several knowledge base completion tasks.
arXiv Detail & Related papers (2023-05-11T22:27:51Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Open Vocabulary Electroencephalography-To-Text Decoding and Zero-shot
Sentiment Classification [78.120927891455]
State-of-the-art brain-to-text systems have achieved great success in decoding language directly from brain signals using neural networks.
In this paper, we extend the problem to open vocabulary Electroencephalography(EEG)-To-Text Sequence-To-Sequence decoding and zero-shot sentence sentiment classification on natural reading tasks.
Our model achieves a 40.1% BLEU-1 score on EEG-To-Text decoding and a 55.6% F1 score on zero-shot EEG-based ternary sentiment classification, which significantly outperforms supervised baselines.
arXiv Detail & Related papers (2021-12-05T21:57:22Z) - Improving Biomedical Pretrained Language Models with Knowledge [22.61591249168801]
We propose KeBioLM, a biomedical pretrained language model that explicitly leverages knowledge from the UMLS knowledge bases.
Specifically, we extract entities from PubMed abstracts and link them to UMLS.
We then train a knowledge-aware language model that firstly applies a text-only encoding layer to learn entity representation and applies a text-entity fusion encoding to aggregate entity representation.
arXiv Detail & Related papers (2021-04-21T03:57:26Z) - Evolution Is All You Need: Phylogenetic Augmentation for Contrastive
Learning [1.7188280334580197]
Self-supervised representation learning of biological sequence embeddings alleviates computational resource constraints on downstream tasks.
We show that contrastive learning using evolutionary phylogenetic augmentation can be used as a representation learning objective.
arXiv Detail & Related papers (2020-12-25T01:35:06Z) - Latent Feature Representation via Unsupervised Learning for Pattern
Discovery in Massive Electron Microscopy Image Volumes [4.278591555984395]
In particular, we give an unsupervised deep learning approach to learning a latent representation that captures semantic similarity in the data set.
We demonstrate the utility of our method applied to nano-scale electron microscopy data, where even relatively small portions of animal brains can require terabytes of image data.
arXiv Detail & Related papers (2020-12-22T17:14:19Z) - Reprogramming Language Models for Molecular Representation Learning [65.00999660425731]
We propose Representation Reprogramming via Dictionary Learning (R2DL) for adversarially reprogramming pretrained language models for molecular learning tasks.
The adversarial program learns a linear transformation between a dense source model input space (language data) and a sparse target model input space (e.g., chemical and biological molecule data) using a k-SVD solver.
R2DL achieves the baseline established by state of the art toxicity prediction models trained on domain-specific data and outperforms the baseline in a limited training-data setting.
arXiv Detail & Related papers (2020-12-07T05:50:27Z) - The Interpretable Dictionary in Sparse Coding [4.205692673448206]
In our work, we illustrate that an ANN, trained using sparse coding under specific sparsity constraints, yields a more interpretable model than the standard deep learning model.
The dictionary learned by sparse coding can be more easily understood and the activations of these elements creates a selective feature output.
arXiv Detail & Related papers (2020-11-24T00:26:40Z) - Concept Learners for Few-Shot Learning [76.08585517480807]
We propose COMET, a meta-learning method that improves generalization ability by learning to learn along human-interpretable concept dimensions.
We evaluate our model on few-shot tasks from diverse domains, including fine-grained image classification, document categorization and cell type annotation.
arXiv Detail & Related papers (2020-07-14T22:04:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.