Related papers: Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach

Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach

URL: http://arxiv.org/abs/2511.01680v1
Date: Mon, 03 Nov 2025 15:42:32 GMT
Title: Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach
Authors: Jacob Carlson,
Abstract summary: Social scientists are increasingly turning to unstructured datasets to unlock new empirical insights.<n>This paper proposes a general and flexible framework for pursuing discovery from unstructured data in a statistically principled way.<n>An open source Jupyter notebook is provided for researchers to implement the framework in their own projects.
Score: 2.162862183067095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Social scientists are increasingly turning to unstructured datasets to unlock new empirical insights, e.g., estimating causal effects on text outcomes, measuring beliefs from open-ended survey responses. In such settings, unsupervised analysis is often of interest, in that the researcher does not want to pre-specify the objects of measurement or otherwise artificially delimit the space of measurable concepts; they are interested in discovery. This paper proposes a general and flexible framework for pursuing discovery from unstructured data in a statistically principled way. The framework leverages recent methods from the literature on machine learning interpretability to map unstructured data points to high-dimensional, sparse, and interpretable dictionaries of concepts; computes (test) statistics of these dictionary entries; and then performs selective inference on them using newly developed statistical procedures for high-dimensional exceedance control of the $k$-FWER under arbitrary dependence. The proposed framework has few researcher degrees of freedom, is fully replicable, and is cheap to implement -- both in terms of financial cost and researcher time. Applications to recent descriptive and causal analyses of unstructured data in empirical economics are explored. An open source Jupyter notebook is provided for researchers to implement the framework in their own projects.

Related papers

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval [60.25608870901428]
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs)<n>We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source robustness.
arXiv Detail & Related papers (2026-03-05T18:42:51Z)
Measurement for Opaque Systems: Multi-source Triangulation with Interpretable Machine Learning [0.0]
We propose a measurement framework that uses indirect data traces, interpretable machine-learning models, and theory-guided triangulation to fill inaccessible measurement spaces.<n>Our framework provides an analytical workflow tailored to quantitative characterization in the absence of data sufficient for conventional statistical or causal inference.
arXiv Detail & Related papers (2026-01-16T20:09:53Z)
Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles [81.89404347890662]
SciTrek is a novel question-answering benchmark designed to evaluate the long-context reasoning capabilities of large language models (LLMs) using scientific articles.<n>Our analysis reveals systematic shortcomings in models' abilities to perform basic numerical operations and accurately locate specific information in long contexts.
arXiv Detail & Related papers (2025-09-25T11:36:09Z)
Valid Inference with Imperfect Synthetic Data [39.10587411316875]
We introduce a new estimator based on generalized method of moments.<n>We find that interactions between the moment residuals of synthetic data and those of real data can greatly improve estimates of the target parameter.
arXiv Detail & Related papers (2025-08-08T18:32:52Z)
A Novel, Human-in-the-Loop Computational Grounded Theory Framework for Big Social Data [8.695136686770772]
We argue that confidence in the credibility and robustness of results depends on adopting a 'human-in-the-loop' methodology.<n>We propose a novel methodological framework for Computational Grounded Theory (CGT) that supports the analysis of large qualitative datasets.
arXiv Detail & Related papers (2025-06-06T13:43:12Z)
A Unifying Framework for Robust and Efficient Inference with Unstructured Data [2.07180164747172]
This paper presents a general framework for conducting efficient inference on parameters derived from unstructured data.<n>We formalize this approach with MAR-S, a framework that unifies and extends existing methods for debiased inference.<n>Within this framework, we develop robust and efficient estimators for both descriptive and causal estimands.
arXiv Detail & Related papers (2025-05-01T04:11:25Z)
A Comprehensive Survey on Imbalanced Data Learning [56.65067795190842]
imbalanced data is prevalent in various types of raw data and hinders the performance of machine learning.<n>This survey systematically analyzes various real-world data formats.<n>It concludes existing researches for different data formats into four categories: data re-balancing, feature representation, training strategy, and ensemble learning.
arXiv Detail & Related papers (2025-02-13T04:53:17Z)
A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset. Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive. Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z)
Large Language Models for Automated Open-domain Scientific Hypotheses Discovery [50.40483334131271]
This work proposes the first dataset for social science academic hypotheses discovery. Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity. A multi- module framework is developed for the task, including three different feedback mechanisms to boost performance.
arXiv Detail & Related papers (2023-09-06T05:19:41Z)
Rethinking Complex Queries on Knowledge Graphs with Neural Link Predictors [58.340159346749964]
We propose a new neural-symbolic method to support end-to-end learning using complex queries with provable reasoning capability. We develop a new dataset containing ten new types of queries with features that have never been considered. Our method outperforms previous methods significantly in the new dataset and also surpasses previous methods in the existing dataset at the same time.
arXiv Detail & Related papers (2023-04-14T11:35:35Z)
Investigating Fairness Disparities in Peer Review: A Language Model Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs) We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date. We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z)
A Pipeline for Analysing Grant Applications [0.0]
This paper investigates whether grant schemes successfully identifies innovative project proposals, as intended. Grant applications are peer-reviewed research proposals that include specific innovation and creativity'' (IC) scores assigned by reviewers. We propose a model with the best performance, a Random Forest (RF) classifier over documents encoded with features.
arXiv Detail & Related papers (2022-10-30T13:43:53Z)
Epistemic Parity: Reproducibility as an Evaluation Metric for Differential Privacy [9.755020926517291]
We propose an evaluation methodology for synthetic data that avoids assumptions about the representativeness of proxy tasks. We measure the likelihood that published conclusions would change had the authors used synthetic data. We advocate for a new class of mechanisms that favor stronger utility guarantees and offer privacy protection.
arXiv Detail & Related papers (2022-08-26T14:57:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.