Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
- URL: http://arxiv.org/abs/2510.24078v1
- Date: Tue, 28 Oct 2025 05:40:14 GMT
- Title: Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
- Authors: William Yang, Xindi Wu, Zhiwei Deng, Esin Tureci, Olga Russakovsky,
- Abstract summary: Text-to-image (T2I) models are increasingly used for synthetic dataset generation.<n>Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data.<n>We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification.
- Score: 31.116511358786084
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model's generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.
Related papers
- Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning [39.35923155873977]
Fine-T2I is a large-scale, high-quality, and fully open dataset for text-to-image fine-tuning.<n>All samples are rigorously filtered for text-image alignment, visual fidelity, and prompt quality.<n>The final dataset contains over 6 million text-image pairs, around 2 TB on disk.
arXiv Detail & Related papers (2026-02-10T06:06:54Z) - Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data [53.040873127309766]
We propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning.<n>Our method outperforms existing models on both in-dataset and cross-dataset evaluations.
arXiv Detail & Related papers (2025-09-08T17:58:06Z) - Stylized Structural Patterns for Improved Neural Network Pre-training [1.8641315013048299]
Deep learning models in computer vision require large datasets of real images, which are difficult to curate and pose privacy and legal concerns.<n>Recent works suggest synthetic data as an alternative, yet models trained with it often underperform.<n>We propose an improved neural fractal formulation through which we introduce a new class of synthetic data.<n>Second, we propose reverse stylization, a technique that transfers visual features from a small, license-free set of real images onto synthetic datasets.
arXiv Detail & Related papers (2025-06-24T09:47:31Z) - CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI [58.35348718345307]
Current efforts to distinguish between real and AI-generated images may lack generalization.<n>We propose a novel framework, Co-Spy, that first enhances existing semantic features.<n>We also create Co-Spy-Bench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models.
arXiv Detail & Related papers (2025-03-24T01:59:29Z) - Ultra-Resolution Adaptation with Ease [62.56434979517156]
We propose a set of key guidelines for ultra-resolution adaptation termed emphURAE.<n>We show that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable.<n>Experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations.
arXiv Detail & Related papers (2025-03-20T16:44:43Z) - Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion [16.356794123589246]
Low-quality or scarce data has posed significant challenges for training deep neural networks in practice.<n>Diffusion Curriculum (DisCL) adjusts the image guidance level of image synthesis for each training stage.<n>DisCL focuses on lower-guidance images of high-quality to learn features as a warm-up of learning higher-guidance images that might be weak on diversity or quality.
arXiv Detail & Related papers (2024-10-17T15:33:35Z) - Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.<n>To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.<n>To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - DataDream: Few-shot Guided Dataset Generation [90.09164461462365]
We propose a framework for synthesizing classification datasets that more faithfully represents the real data distribution.
DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model.
We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets.
arXiv Detail & Related papers (2024-07-15T17:10:31Z) - Zero-Shot Distillation for Image Encoders: How to Make Effective Use of Synthetic Data [40.37396692278567]
We focus on training smaller variants of the image encoder, which suffices for efficient zero-shot classification.
The use of synthetic data has shown promise in distilling representations from larger teachers, resulting in strong few-shot and linear probe performance.
We find that this approach surprisingly fails in true zero-shot settings when using contrastive losses.
arXiv Detail & Related papers (2024-04-25T14:24:41Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Feedback-guided Data Synthesis for Imbalanced Classification [10.836265321046561]
We introduce a framework for augmenting static datasets with useful synthetic samples.
We find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse.
On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes.
arXiv Detail & Related papers (2023-09-29T21:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.