Limited Reference, Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes
- URL: http://arxiv.org/abs/2509.09960v1
- Date: Fri, 12 Sep 2025 04:34:46 GMT
- Title: Limited Reference, Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes
- Authors: Mingxuan Jiang, Yongxin Wang, Ziyue Dai, Yicun Liu, Hongyi Nie, Sen Liu, Hongfeng Chai,
- Abstract summary: ReFine is a framework that guides generation toward domain-specific feature distribution.<n>Experiments on various regression and classification benchmarks demonstrate that ReFine consistently outperforms state-of-the-art methods.
- Score: 7.036974567001374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic tabular data generation is increasingly essential in data management, supporting downstream applications when real-world and high-quality tabular data is insufficient. Existing tabular generation approaches, such as generative adversarial networks (GANs), diffusion models, and fine-tuned Large Language Models (LLMs), typically require sufficient reference data, limiting their effectiveness in domain-specific databases with scarce records. While prompt-based LLMs offer flexibility without parameter tuning, they often fail to capture dataset-specific feature-label dependencies and generate redundant data, leading to degradation in downstream task performance. To overcome these issues, we propose ReFine, a framework that (i) derives symbolic "if-then" rules from interpretable models and embeds them into prompts to explicitly guide generation toward domain-specific feature distribution, and (ii) applies a dual-granularity filtering strategy that suppresses over-sampling patterns and selectively refines rare but informative samples to reduce distributional imbalance. Extensive experiments on various regression and classification benchmarks demonstrate that ReFine consistently outperforms state-of-the-art methods, achieving up to 0.44 absolute improvement in R-squared for regression and 10.0 percent relative improvement in F1 score for classification tasks.
Related papers
- Generative Data Transformation: From Mixed to Unified Data [57.84692191369066]
textscTaesar is a emphdata-centric framework for textbftarget-textbfal textbfregeneration.<n>It encodes cross-domain context into target sequences, enabling standard models to learn intricate dependencies without complex fusion architectures.
arXiv Detail & Related papers (2026-02-26T08:30:09Z) - TabINR: An Implicit Neural Representation Framework for Tabular Data Imputation [0.6407815281667869]
We introduce TabINR, an auto-decoder based Implicit Neural Representation framework that models tables as neural functions.<n>We evaluate our framework across a diverse range of twelve real-world datasets and multiple missingness mechanisms.
arXiv Detail & Related papers (2025-10-01T17:24:35Z) - Towards Universal Debiasing for Language Models-based Tabular Data Generation [16.31419748401203]
We introduce a universal debiasing framework that minimizes group-level dependencies by simultaneously reducing the mutual information between advantaged and protected attributes.<n>Our framework effectively balances fairness and utility, offering a scalable and practical solution for debiasing in high-stakes applications.
arXiv Detail & Related papers (2025-09-20T00:06:53Z) - SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - CART-based Synthetic Tabular Data Generation for Imbalanced Regression [1.342834401139078]
We propose adapting an existing CART-based synthetic data generation method, tailoring it for imbalanced regression.<n>The new method integrates relevance and density-based mechanisms to guide sampling in sparse regions of the target space.<n>Our experimental study focuses on the prediction of extreme target values across benchmark datasets.
arXiv Detail & Related papers (2025-06-03T12:42:20Z) - Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval [49.669503570350166]
Generative information retrieval (GenIR) is a promising neural retrieval paradigm that formulates document retrieval as a document identifier (docid) generation task.<n>Existing GenIR models suffer from token-level misalignment, where models trained to predict the next token often fail to capture document-level relevance effectively.<n>We propose direct document relevance optimization (DDRO), which aligns token-level docid generation with document-level relevance estimation through direct optimization via pairwise ranking.
arXiv Detail & Related papers (2025-04-07T15:27:37Z) - Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.<n>They generate only a limited range of perturbations for a single Information Extraction (IE) task.<n>Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.<n>We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z) - SampleLLM: Optimizing Tabular Data Synthesis in Recommendations [46.689486044254544]
Tabular data synthesis is crucial in machine learning, yet existing general methods are highly data-dependent and often fall short in recommender systems.<n>This limitation arises from their difficulty in capturing complex distributions and understanding feature relationships from sparse and limited data.<n>We propose a novel two-stage framework named SampleLLM to improve the quality of LLM-based data synthesis for recommendation tasks.
arXiv Detail & Related papers (2025-01-27T15:12:27Z) - Towards Generalizable Trajectory Prediction Using Dual-Level Representation Learning And Adaptive Prompting [107.4034346788744]
Existing vehicle trajectory prediction models struggle with generalizability, prediction uncertainties, and handling complex interactions.<n>We propose Perceiver with Register queries (PerReg+), a novel trajectory prediction framework that introduces: (1) Dual-Level Representation Learning via Self-Distillation (SD) and Masked Reconstruction (MR), capturing global context and fine-grained details; (2) Enhanced Multimodality using register-based queries and pretraining, eliminating the need for clustering and suppression; and (3) Adaptive Prompt Tuning during fine-tuning, freezing the main architecture and optimizing a small number of prompts for efficient adaptation.
arXiv Detail & Related papers (2025-01-08T20:11:09Z) - A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis [2.2451409468083114]
We propose a novel correlation- and mean-aware loss function for generative adversarial networks (GANs)
The proposed loss function demonstrates statistically significant improvements over existing methods in capturing the true data distribution.
The benchmarking framework shows that the enhanced synthetic data quality leads to improved performance in downstream machine learning tasks.
arXiv Detail & Related papers (2024-05-27T09:08:08Z) - Fake It Till Make It: Federated Learning with Consensus-Oriented
Generation [52.82176415223988]
We propose federated learning with consensus-oriented generation (FedCOG)
FedCOG consists of two key components at the client side: complementary data generation and knowledge-distillation-based model training.
Experiments on classical and real-world FL datasets show that FedCOG consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-12-10T18:49:59Z) - RoPDA: Robust Prompt-based Data Augmentation for Low-Resource Named
Entity Recognition [10.03246698225533]
Robust Prompt-based Data Augmentation (RoPDA) for low-resource NER
Based on pre-trained language models (PLMs) with continuous prompt, RoPDA performs entity augmentation and context augmentation.
Experiments on three benchmarks from different domains demonstrate that RoPDA significantly improves upon strong baselines.
arXiv Detail & Related papers (2023-07-11T14:44:14Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.