Related papers: Automated Question Generation on Tabular Data for Conversational Data Exploration

Automated Question Generation on Tabular Data for Conversational Data Exploration

URL: http://arxiv.org/abs/2407.12859v1
Date: Wed, 10 Jul 2024 08:07:05 GMT
Title: Automated Question Generation on Tabular Data for Conversational Data Exploration
Authors: Ritwik Chaudhuri, Rajmohan C, Kirushikesh DB, Arvind Agarwal,
Abstract summary: We propose a system that recommends interesting questions in natural language based on relevant slices of a dataset in a conversational setting. We use our own fine-tuned variation of a pre-trained language model(T5) to generate natural language questions in a specific manner.
Score: 1.2574534342156884
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Exploratory data analysis (EDA) is an essential step for analyzing a dataset to derive insights. Several EDA techniques have been explored in the literature. Many of them leverage visualizations through various plots. But it is not easy to interpret them for a non-technical user, and producing appropriate visualizations is also tough when there are a large number of columns. Few other works provide a view of some interesting slices of data but it is still difficult for the user to draw relevant insights from them. Of late, conversational data exploration is gaining a lot of traction among non-technical users. It helps the user to explore the dataset without having deep technical knowledge about the data. Towards this, we propose a system that recommends interesting questions in natural language based on relevant slices of a dataset in a conversational setting. Specifically, given a dataset, we pick a select set of interesting columns and identify interesting slices of such columns and column combinations based on few interestingness measures. We use our own fine-tuned variation of a pre-trained language model(T5) to generate natural language questions in a specific manner. We then slot-fill values in the generated questions and rank them for recommendations. We show the utility of our proposed system in a coversational setting with a collection of real datasets.

Related papers

DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI [24.349800949355465]
Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets. We propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction.
arXiv Detail & Related papers (2024-12-09T08:47:05Z)
DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries [0.0]
We evaluate OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS) The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards.
arXiv Detail & Related papers (2024-03-29T22:59:34Z)
Automatic Histograms: Leveraging Language Models for Text Dataset Exploration [6.273685997216551]
We present AutoHistograms, a visualization tool leveraging Large Language Models. AutoHistograms automatically identifies relevant features, visualizes them with histograms, and allows the user to interactively query the dataset for categories of entities. In a user study with 10 data workers, we observe that participants can quickly identify insights and explore the data using AutoHistograms.
arXiv Detail & Related papers (2024-02-21T22:29:16Z)
NNOSE: Nearest Neighbor Occupational Skill Extraction [55.22292957778972]
We tackle the complexity in occupational skill datasets. We employ an external datastore for retrieving similar skills in a dataset-unifying manner. We observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.
arXiv Detail & Related papers (2024-01-30T15:18:29Z)
Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS) We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z)
DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description. To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set. This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z)
Demonstration of InsightPilot: An LLM-Empowered Automated Data Exploration System [48.62158108517576]
We introduce InsightPilot, an automated data exploration system designed to simplify the data exploration process. InsightPilot automatically selects appropriate analysis intents, such as understanding, summarizing, and explaining. In brief, an IQuery is an abstraction and automation of data analysis operations, which mimics the approach of data analysts.
arXiv Detail & Related papers (2023-04-02T07:27:49Z)
Aspect-Oriented Summarization through Query-Focused Extraction [23.62412515574206]
Real users' needs often fall more closely into aspects, broad topics in a dataset the user is interested in rather than specific queries. We benchmark extractive query-focused training schemes, and propose a contrastive augmentation approach to train the model. We evaluate on two aspect-oriented datasets and find this approach yields focused summaries, better than those from a generic summarization system.
arXiv Detail & Related papers (2021-10-15T18:06:21Z)
PeCoQ: A Dataset for Persian Complex Question Answering over Knowledge Graph [0.0]
This paper introduces textitPeCoQ, a dataset for Persian question answering. This dataset contains 10,000 complex questions and answers extracted from the Persian knowledge graph, FarsBase. There are different types of complexities in the dataset, such as multi-relation, multi-entity, ordinal, and temporal constraints.
arXiv Detail & Related papers (2021-06-27T08:21:23Z)
REGRAD: A Large-Scale Relational Grasp Dataset for Safe and Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps. Our dataset is collected in both forms of 2D images and 3D point clouds. Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z)
Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document. We show that readers engage in a series of pragmatic strategies to seek information. We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z)
Quda: Natural Language Queries for Visual Data Analytics [33.983060903399554]
We present a new dataset, called Quda, that aims to help V-NLIs recognize analytic tasks from free-form natural language. Our dataset contains $14,035$ diverse user queries, and each is annotated with one or multiple analytic tasks. This work is the first attempt to construct a large-scale corpus for recognizing analytic tasks.
arXiv Detail & Related papers (2020-05-07T05:35:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.