Related papers: Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction

Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction

URL: http://arxiv.org/abs/2502.17541v1
Date: Mon, 24 Feb 2025 18:42:33 GMT
Title: Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction
Authors: Michal Bravansky, Vaclav Kubon, Suhas Hariharan, Robert Kirk,
Abstract summary: Large language models (LLMs) show promise in providing such natural language interpretations of data.<n>We propose a domain-agnostic method for dataset featurization that provides precise control over the number of features extracted.
Score: 1.0784083404427411
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interpreting data is central to modern research. Large language models (LLMs) show promise in providing such natural language interpretations of data, yet simple feature extraction methods such as prompting often fail to produce accurate and versatile descriptions for diverse datasets and lack control over granularity and scale. To address these limitations, we propose a domain-agnostic method for dataset featurization that provides precise control over the number of features extracted while maintaining compact and descriptive representations comparable to human expert labeling. Our method optimizes the selection of informative binary features by evaluating the ability of an LLM to reconstruct the original data using those features. We demonstrate its effectiveness in dataset modeling tasks and through two case studies: (1) Constructing a feature representation of jailbreak tactics that compactly captures both the effectiveness and diversity of a larger set of human-crafted attacks; and (2) automating the discovery of features that align with human preferences, achieving accuracy and robustness comparable to expert-crafted features. Moreover, we show that the pipeline scales effectively, improving as additional features are sampled, making it suitable for large and diverse datasets.

Related papers

D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding [36.321156992727055]
D2AF is a robust annotation framework for visual grounding using only input images.<n>By implementing dual-driven annotation strategies, we effectively generate detailed region-text pairs.<n>Our findings demonstrate that increasing data volume enhances model performance.
arXiv Detail & Related papers (2025-05-30T09:04:47Z)
Enhancing Dataset Distillation via Non-Critical Region Refinement [29.858754062202213]
We introduce the Non-Critical Region Refinement dataset Distillation (NRR-DD) method, which preserves instance-specific details and fine-grained regions in synthetic data.<n>We also present Distance-Based Representative (DBR) knowledge transfer, which eliminates the need for soft labels in training.<n> Experimental results show that NRR-DD achieves state-of-the-art performance on both small- and large-scale datasets.
arXiv Detail & Related papers (2025-03-24T01:20:22Z)
A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z)
Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction [0.0]
PropertyExtractor is an open-source tool that blends zero-shot with few-shot in-context learning. Our tests on material data demonstrate precision and recall that exceed 95% with an error rate of approximately 9%.
arXiv Detail & Related papers (2024-05-16T21:15:51Z)
Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets. We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z)
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective. The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets. Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z)
Multi-label and Multi-target Sampling of Machine Annotation for Computational Stance Detection [44.90471123149513]
We introduce a multi-label and multi-target sampling strategy to optimize the annotation quality. Experimental results on the benchmark stance detection corpora show that our method can significantly improve performance and learning efficacy.
arXiv Detail & Related papers (2023-11-08T06:54:34Z)
UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models [24.50445616970387]
We introduce UP-DP, a simple yet effective unsupervised prompt learning approach that adapts vision-language models for data pre-selection. Specifically, with the BLIP-2 parameters frozen, we train text prompts to extract the joint features with improved representation. We extensively compare our method with the state-of-the-art using seven benchmark datasets in different settings, achieving up to a performance gain of 20%.
arXiv Detail & Related papers (2023-07-20T20:45:13Z)
An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models. We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)
Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning. Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.