EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics
- URL: http://arxiv.org/abs/2509.13858v1
- Date: Wed, 17 Sep 2025 09:48:39 GMT
- Title: EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics
- Authors: Qianxin Xia, Jiawei Du, Guoming Lu, Zhiyong Shu, Jielei Wang,
- Abstract summary: EDITS is a novel framework that exploits the implicit textual semantics within the image data to achieve enhanced distillation.<n>In this paper, we propose EDITS, a novel framework that exploits the implicit textual semantics within the image data to achieve enhanced distillation.
- Score: 12.818622596576775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset distillation aims to synthesize a compact dataset from the original large-scale one, enabling highly efficient learning while preserving competitive model performance. However, traditional techniques primarily capture low-level visual features, neglecting the high-level semantic and structural information inherent in images. In this paper, we propose EDITS, a novel framework that exploits the implicit textual semantics within the image data to achieve enhanced distillation. First, external texts generated by a Vision Language Model (VLM) are fused with image features through a Global Semantic Query module, forming the prior clustered buffer. Local Semantic Awareness then selects representative samples from the buffer to construct image and text prototypes, with the latter produced by guiding a Large Language Model (LLM) with meticulously crafted prompt. Ultimately, Dual Prototype Guidance strategy generates the final synthetic dataset through a diffusion model. Extensive experiments confirm the effectiveness of our method.Source code is available in: https://github.com/einsteinxia/EDITS.
Related papers
- Can Synthetic Images Serve as Effective and Efficient Class Prototypes? [4.813908624670794]
Contrastive Language-Image Pre-training (CLIP) relies on annotated text-to-image pairs for aligning visual and textual modalities.<n>This dependency introduces substantial cost and accuracy requirement in preparing high-quality datasets.<n>We introduce a Contrastive Language-Image Pre-training via Large-Language-Model-based Generation (LGCLIP)" framework.
arXiv Detail & Related papers (2025-12-19T01:39:43Z) - Dataset Distillation via Vision-Language Category Prototype [14.526547847730548]
We introduce vision-language methods to distill language information and collaboratively synthesize data with image prototypes.<n>This framework demonstrates broad applicability across datasets without pre-existing text descriptions.<n>The proposed approach generates logically coherent images containing target objects, achieving state-of-the-art validation performance and demonstrating robust generalization.
arXiv Detail & Related papers (2025-06-30T07:34:33Z) - RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm [34.02250139766494]
Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a variety of benchmarks.<n>A substantial volume of multimodal interleaved documents remains underutilized for contrastive vision-language representation learning.<n>We establish a Real-World Data Extraction pipeline to extract high-quality images and texts.<n>Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts.<n>We construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M.
arXiv Detail & Related papers (2025-02-18T03:58:38Z) - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings [16.28853186016663]
We create synthetic image-text pairs for efficient and effective Visual-Language Models (VLMs) training.
Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM.
Our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data.
arXiv Detail & Related papers (2024-03-12T15:36:42Z) - Paragraph-to-Image Generation with Information-Enriched Diffusion Model [62.81033771780328]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.<n>It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.<n>The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion
Models [61.906934570771256]
We present a generic dataset generation model that can produce diverse synthetic images and perception annotations.
Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation.
We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module.
arXiv Detail & Related papers (2023-08-11T14:38:11Z) - Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.