Related papers: Quality Assessment of Tabular Data using Large Language Models and Code Generation

Quality Assessment of Tabular Data using Large Language Models and Code Generation

URL: http://arxiv.org/abs/2509.10572v2
Date: Sun, 21 Sep 2025 02:54:05 GMT
Title: Quality Assessment of Tabular Data using Large Language Models and Code Generation
Authors: Ashlesha Akella, Akshar Kaul, Krishnasuri Narayanam, Sameep Mehta,
Abstract summary: We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation.<n>After filtering data samples through traditional clustering, we iteratively prompt LLMs to produce semantically valid quality rules.<n>To generate reliable quality rules, we aid LLMs with retrieval-augmented generation (RAG) by leveraging external knowledge sources and domain-specific few-shot examples.
Score: 11.92289180699673
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation. After filtering data samples through traditional clustering, we iteratively prompt LLMs to produce semantically valid quality rules and synthesize their executable validators through code-generating LLMs. To generate reliable quality rules, we aid LLMs with retrieval-augmented generation (RAG) by leveraging external knowledge sources and domain-specific few-shot examples. Robust guardrails ensure the accuracy and consistency of both rules and code snippets. Extensive evaluations on benchmark datasets confirm the effectiveness of our approach.

Related papers

Curate-Train-Refine: A Closed-Loop Agentic Framework for Zero Shot Classification [2.1937565888932653]
Large language models (LLMs) and high-capacity encoders have advanced zero and few-shot classification, but their inference cost and latency limit practical deployment.<n>We propose training lightweight text classifiers using dynamically generated supervision from an LLM.<n>Our method employs an iterative, agentic loop in which the LLM curates training data, analyzes model successes and failures, and synthesizes targeted examples to address observed errors.
arXiv Detail & Related papers (2026-01-23T08:04:09Z)
Knowledge-to-Data: LLM-Driven Synthesis of Structured Network Traffic for Testbed-Free IDS Evaluation [0.4893345190925178]
This paper investigates whether Large Language Models (LLMs) can operate as controlled knowledge-to-data engines for generating structured synthetic network traffic datasets.<n>We propose a methodology that combines protocol documentation, attack semantics, and explicit statistical rules to condition LLMs without fine-tuning or access to raw samples.<n>Results show that, under explicit constraints, LLM-generated datasets can closely approximate the statistical and structural characteristics of real network traffic.
arXiv Detail & Related papers (2026-01-08T15:31:33Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination [18.006532081289627]
We propose tool, a novel benchmarking suite for evaluating Code LLMs under potential data contamination.<n>tool employs multiple agents to extract and modify the context without altering the core logic, generating semantically equivalent variations.<n>Results show that tool effectively benchmarks reasoning capabilities under contamination risks while generating diverse problem sets to ensure consistent and reliable evaluations.
arXiv Detail & Related papers (2025-03-06T06:56:59Z)
A Systematic Approach for Assessing Large Language Models' Test Case Generation Capability [0.8287206589886879]
We propose the Generated Benchmark from Control-Flow Structure and Variable Usage Composition (GBCV) approach to evaluate large language models (LLMs)<n>By leveraging basic control-flow structures and variable usage, GBCV provides a flexible framework to create a spectrum of programs ranging from simple to complex.<n>Our findings indicate that GPT-4o performs better on complex program structures, while all models effectively detect boundary values in simple conditions but face challenges with arithmetic computations.
arXiv Detail & Related papers (2025-02-05T03:51:44Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Rule-based Data Selection for Large Language Models [9.886837013587124]
The quality of training data significantly impacts the performance of large language models (LLMs)<n>There are increasing studies using LLMs to rate and select data based on several human-crafted metrics (rules)<n>These conventional rule-based approaches often depend too heavily on human vectorss, lack effective metrics for assessing rules, and exhibit limited adaptability to new tasks.
arXiv Detail & Related papers (2024-10-07T03:13:06Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
DCA-Bench: A Benchmark for Dataset Curation Agents [9.60250892491588]
Data quality issues, such as incomplete documentation, inaccurate labels, ethical concerns, and outdated information, remain common in widely used datasets.<n>With the surging ability of large language models (LLM), it's promising to streamline the discovery of hidden dataset issues with LLM agents.<n>In this work, we establish a benchmark to measure LLM agent's ability to tackle this challenge.
arXiv Detail & Related papers (2024-06-11T14:02:23Z)
CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.