Sparse Autoencoders for Hypothesis Generation
- URL: http://arxiv.org/abs/2502.04382v1
- Date: Wed, 05 Feb 2025 18:58:02 GMT
- Title: Sparse Autoencoders for Hypothesis Generation
- Authors: Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, Emma Pierson,
- Abstract summary: HypotheSAEs is a method to hypothesize relationships between text data (e.g., headlines) and a target variable (e.g., clicks)
We train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution.
We select features that predict the target variable, and (3) generate a natural language interpretation of each feature.
- Score: 1.5450225594635711
- License:
- Abstract: We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., "mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.
Related papers
- Surprise! Uniform Information Density Isn't the Whole Story: Predicting Surprisal Contours in Long-form Discourse [54.08750245737734]
We propose that speakers modulate information rate based on location within a hierarchically-structured model of discourse.
We find that hierarchical predictors are significant predictors of a discourse's information contour and that deeply nested hierarchical predictors are more predictive than shallow ones.
arXiv Detail & Related papers (2024-10-21T14:42:37Z) - Using LLMs for Explaining Sets of Counterfactual Examples to Final Users [0.0]
In automated decision-making scenarios, causal inference methods can analyze the underlying data-generation process.
Counterfactual examples explore hypothetical scenarios where a minimal number of factors are altered.
We propose a novel multi-step pipeline that uses counterfactuals to generate natural language explanations of actions that will lead to a change in outcome.
arXiv Detail & Related papers (2024-08-27T15:13:06Z) - Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers.
We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models.
Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z) - Hypothesis Generation with Large Language Models [28.73562677221476]
We focus on hypothesis generation based on data (i.e., labeled examples)
Inspired by multi-armed bandits, we design a reward function to inform the exploitation-exploration tradeoff in the update process.
Our algorithm is able to generate hypotheses that enable much better predictive performance than few-shot prompting in classification tasks.
arXiv Detail & Related papers (2024-04-05T18:00:07Z) - A Hypothesis-Driven Framework for the Analysis of Self-Rationalising
Models [0.8702432681310401]
We use a Bayesian network to implement a hypothesis about how a task is solved.
The resulting models do not exhibit a strong similarity to GPT-3.5.
We discuss the implications of this as well as the framework's potential to approximate LLM decisions better in future work.
arXiv Detail & Related papers (2024-02-07T12:26:12Z) - Estimation of embedding vectors in high dimensions [10.55292041492388]
We consider a simple probability model for discrete data where there is some "true" but unknown embedding.
Under this model, it is shown that the embeddings can be learned by a variant of low-rank approximate message passing (AMP) method.
Our theoretical findings are validated by simulations on both synthetic data and real text data.
arXiv Detail & Related papers (2023-12-12T23:41:59Z) - Prototype-based Aleatoric Uncertainty Quantification for Cross-modal
Retrieval [139.21955930418815]
Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space.
However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts.
We propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.
arXiv Detail & Related papers (2023-09-29T09:41:19Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Large Language Models for Automated Open-domain Scientific Hypotheses Discovery [50.40483334131271]
This work proposes the first dataset for social science academic hypotheses discovery.
Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity.
A multi- module framework is developed for the task, including three different feedback mechanisms to boost performance.
arXiv Detail & Related papers (2023-09-06T05:19:41Z) - Diversify and Disambiguate: Learning From Underspecified Data [76.67228314592904]
DivDis is a framework that learns a diverse collection of hypotheses for a task by leveraging unlabeled data from the test distribution.
We demonstrate the ability of DivDis to find hypotheses that use robust features in image classification and natural language processing problems with underspecification.
arXiv Detail & Related papers (2022-02-07T18:59:06Z) - Perturbing Inputs for Fragile Interpretations in Deep Natural Language
Processing [18.91129968022831]
Interpretability methods need to be robust for trustworthy NLP applications in high-stake areas like medicine or finance.
Our paper demonstrates how interpretations can be manipulated by making simple word perturbations on an input text.
arXiv Detail & Related papers (2021-08-11T02:07:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.