Automatic Histograms: Leveraging Language Models for Text Dataset
Exploration
- URL: http://arxiv.org/abs/2402.14880v1
- Date: Wed, 21 Feb 2024 22:29:16 GMT
- Title: Automatic Histograms: Leveraging Language Models for Text Dataset
Exploration
- Authors: Emily Reif, Crystal Qian, James Wexler, Minsuk Kahng
- Abstract summary: We present AutoHistograms, a visualization tool leveraging Large Language Models.
AutoHistograms automatically identifies relevant features, visualizes them with histograms, and allows the user to interactively query the dataset for categories of entities.
In a user study with 10 data workers, we observe that participants can quickly identify insights and explore the data using AutoHistograms.
- Score: 6.273685997216551
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Making sense of unstructured text datasets is perennially difficult, yet
increasingly relevant with Large Language Models. Data workers often rely on
dataset summaries, especially distributions of various derived features. Some
features, like toxicity or topics, are relevant to many datasets, but many
interesting features are domain specific: instruments and genres for a music
dataset, or diseases and symptoms for a medical dataset. Accordingly, data
workers often run custom analyses for each dataset, which is cumbersome and
difficult. We present AutoHistograms, a visualization tool leveragingLLMs.
AutoHistograms automatically identifies relevant features, visualizes them with
histograms, and allows the user to interactively query the dataset for
categories of entities and create new histograms. In a user study with 10 data
workers (n=10), we observe that participants can quickly identify insights and
explore the data using AutoHistograms, and conceptualize a broad range of
applicable use cases. Together, this tool and user study contributeto the
growing field of LLM-assisted sensemaking tools.
Related papers
- infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Visualizing Linguistic Diversity of Text Datasets Synthesized by Large
Language Models [9.808214545408541]
LinguisticLens is a novel inter-active visualization tool for making sense of and analyzing syntactic diversity of datasets.
It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples.
arXiv Detail & Related papers (2023-05-19T00:53:45Z) - ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization
of Long and Short Summaries [0.26097841018267615]
Automatic chart to text summarization is an effective tool for the visually impaired people.
In this paper, we propose ChartSumm: a large-scale benchmark dataset consisting of a total of 84,363 charts.
arXiv Detail & Related papers (2023-04-26T15:25:24Z) - Demonstration of InsightPilot: An LLM-Empowered Automated Data
Exploration System [48.62158108517576]
We introduce InsightPilot, an automated data exploration system designed to simplify the data exploration process.
InsightPilot automatically selects appropriate analysis intents, such as understanding, summarizing, and explaining.
In brief, an IQuery is an abstraction and automation of data analysis operations, which mimics the approach of data analysts.
arXiv Detail & Related papers (2023-04-02T07:27:49Z) - Decoding Attention from Gaze: A Benchmark Dataset and End-to-End Models [6.642042615005632]
Eye-tracking has potential to provide rich behavioral data about human cognition in ecologically valid environments.
This paper studies using computer vision tools for "attention decoding", the task of assessing the locus of a participant's overt visual attention over time.
arXiv Detail & Related papers (2022-11-20T12:24:57Z) - Chart-to-Text: A Large-Scale Benchmark for Chart Summarization [9.647079534077472]
We present Chart-to-text, a large-scale benchmark with two datasets and a total of 44,096 charts.
We explain the dataset construction process and analyze the datasets.
arXiv Detail & Related papers (2022-03-12T17:01:38Z) - Data Collection and Labeling of Real-Time IoT-Enabled Bio-Signals in
Everyday Settings for Mental Health Improvement [6.7377504888630675]
Real-time physiological data collection and analysis play a central role in modern well-being applications.
This paper builds a system for the real-time collection and analysis of photoplethysmogram, acceleration, gyroscope, and gravity data from a wearable sensor.
arXiv Detail & Related papers (2021-08-02T20:56:48Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - Open Graph Benchmark: Datasets for Machine Learning on Graphs [86.96887552203479]
We present the Open Graph Benchmark (OGB) to facilitate scalable, robust, and reproducible graph machine learning (ML) research.
OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains.
For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics.
arXiv Detail & Related papers (2020-05-02T03:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.