Automatic Histograms: Leveraging Language Models for Text Dataset
Exploration
- URL: http://arxiv.org/abs/2402.14880v1
- Date: Wed, 21 Feb 2024 22:29:16 GMT
- Title: Automatic Histograms: Leveraging Language Models for Text Dataset
Exploration
- Authors: Emily Reif, Crystal Qian, James Wexler, Minsuk Kahng
- Abstract summary: We present AutoHistograms, a visualization tool leveraging Large Language Models.
AutoHistograms automatically identifies relevant features, visualizes them with histograms, and allows the user to interactively query the dataset for categories of entities.
In a user study with 10 data workers, we observe that participants can quickly identify insights and explore the data using AutoHistograms.
- Score: 6.273685997216551
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Making sense of unstructured text datasets is perennially difficult, yet
increasingly relevant with Large Language Models. Data workers often rely on
dataset summaries, especially distributions of various derived features. Some
features, like toxicity or topics, are relevant to many datasets, but many
interesting features are domain specific: instruments and genres for a music
dataset, or diseases and symptoms for a medical dataset. Accordingly, data
workers often run custom analyses for each dataset, which is cumbersome and
difficult. We present AutoHistograms, a visualization tool leveragingLLMs.
AutoHistograms automatically identifies relevant features, visualizes them with
histograms, and allows the user to interactively query the dataset for
categories of entities and create new histograms. In a user study with 10 data
workers (n=10), we observe that participants can quickly identify insights and
explore the data using AutoHistograms, and conceptualize a broad range of
applicable use cases. Together, this tool and user study contributeto the
growing field of LLM-assisted sensemaking tools.
Related papers
- Capturing and Anticipating User Intents in Data Analytics via Knowledge Graphs [0.061446808540639365]
This work explores the usage of Knowledge Graphs (KG) as a basic framework for capturing a human-centered manner complex analytics.
The data stored in the generated KG can then be exploited to provide assistance (e.g., recommendations) to the users interacting with these systems.
arXiv Detail & Related papers (2024-11-01T20:45:23Z) - "Show Me What's Wrong!": Combining Charts and Text to Guide Data Analysis [4.016592757754338]
In the context of financial fraud detection, analysts must quickly identify suspicious activity among transactional data.
This is an iterative process made of complex exploratory tasks such as recognizing patterns, grouping, and comparing.
To mitigate the information overload inherent to these steps, we present a tool combining automated information highlights, Large Language Model generated textual insights, and visual analytics.
arXiv Detail & Related papers (2024-10-01T14:16:10Z) - PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation [2.1184929769291294]
This paper presents a novel synthetic dataset designed to evaluate the proficiency of large language models in interpreting data visualizations.
Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios.
We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models.
arXiv Detail & Related papers (2024-09-04T11:19:17Z) - Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data.
We introduce MMTabQA, a new dataset designed for this purpose.
Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization
of Long and Short Summaries [0.26097841018267615]
Automatic chart to text summarization is an effective tool for the visually impaired people.
In this paper, we propose ChartSumm: a large-scale benchmark dataset consisting of a total of 84,363 charts.
arXiv Detail & Related papers (2023-04-26T15:25:24Z) - Demonstration of InsightPilot: An LLM-Empowered Automated Data
Exploration System [48.62158108517576]
We introduce InsightPilot, an automated data exploration system designed to simplify the data exploration process.
InsightPilot automatically selects appropriate analysis intents, such as understanding, summarizing, and explaining.
In brief, an IQuery is an abstraction and automation of data analysis operations, which mimics the approach of data analysts.
arXiv Detail & Related papers (2023-04-02T07:27:49Z) - Decoding Attention from Gaze: A Benchmark Dataset and End-to-End Models [6.642042615005632]
Eye-tracking has potential to provide rich behavioral data about human cognition in ecologically valid environments.
This paper studies using computer vision tools for "attention decoding", the task of assessing the locus of a participant's overt visual attention over time.
arXiv Detail & Related papers (2022-11-20T12:24:57Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - Open Graph Benchmark: Datasets for Machine Learning on Graphs [86.96887552203479]
We present the Open Graph Benchmark (OGB) to facilitate scalable, robust, and reproducible graph machine learning (ML) research.
OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains.
For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics.
arXiv Detail & Related papers (2020-05-02T03:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.