Leveraging Multi-Modal Information to Enhance Dataset Distillation
- URL: http://arxiv.org/abs/2505.08605v2
- Date: Thu, 15 May 2025 08:19:15 GMT
- Title: Leveraging Multi-Modal Information to Enhance Dataset Distillation
- Authors: Zhe Li, Hadrien Reynaud, Bernhard Kainz,
- Abstract summary: We introduce two key enhancements to dataset distillation: caption-guided supervision and object-centric masking.<n>To integrate textual information, we propose two strategies for leveraging caption features: the feature concatenation and caption matching.<n> Comprehensive evaluations demonstrate that integrating caption-based guidance and object-centric masking enhances dataset distillation.
- Score: 9.251951276795255
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Dataset distillation aims to create a compact and highly representative synthetic dataset that preserves the knowledge of a larger real dataset. While existing methods primarily focus on optimizing visual representations, incorporating additional modalities and refining object-level information can significantly improve the quality of distilled datasets. In this work, we introduce two key enhancements to dataset distillation: caption-guided supervision and object-centric masking. To integrate textual information, we propose two strategies for leveraging caption features: the feature concatenation, where caption embeddings are fused with visual features at the classification stage, and caption matching, which introduces a caption-based alignment loss during training to ensure semantic coherence between real and synthetic data. Additionally, we apply segmentation masks to isolate target objects and remove background distractions, introducing two loss functions designed for object-centric learning: masked feature alignment loss and masked gradient matching loss. Comprehensive evaluations demonstrate that integrating caption-based guidance and object-centric masking enhances dataset distillation, leading to synthetic datasets that achieve superior performance on downstream tasks.
Related papers
- "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - Enhancing Image Matting in Real-World Scenes with Mask-Guided Iterative Refinement [4.006320049969407]
Mask2Alpha is an iterative refinement framework designed to enhance semantic comprehension, instance awareness, and fine-detail recovery in image matting.<n>Our framework leverages self-supervised Vision Transformer features as semantic priors, strengthening contextual understanding in complex scenarios.<n>Mask2Alpha consistently achieves state-of-the-art results, showcasing its effectiveness in accurate and efficient image matting.
arXiv Detail & Related papers (2025-02-24T12:16:28Z) - Enhancing Generalization via Sharpness-Aware Trajectory Matching for Dataset Condensation [37.77454972709646]
We introduce Sharpness-Aware Trajectory Matching (SATM), which enhances the generalization capability of learned synthetic datasets.<n>Our approach is mathematically well-supported and straightforward to implement along with controllable computational overhead.
arXiv Detail & Related papers (2025-02-03T22:30:06Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Prioritize Alignment in Dataset Distillation [27.71563788300818]
Existing methods use the agent model to extract information from the target dataset and embed it into the distilled dataset.
We find that existing methods introduce misaligned information in both information extraction and embedding stages.
We propose Prioritize Alignment in dataset Distillation (PAD), which aligns information from the following two perspectives.
arXiv Detail & Related papers (2024-08-06T17:07:28Z) - Hierarchical Features Matter: A Deep Exploration of Progressive Parameterization Method for Dataset Distillation [44.03611131165989]
We propose a novel generative parameterization method dubbed Hierarchical generative Distillation (H-PD)<n>The proposed H-PD achieves a significant performance improvement under various settings with equivalent time consumption.<n>It even surpasses current generative distillation using diffusion models under extreme compression ratios IPC=1 and IPC=10.
arXiv Detail & Related papers (2024-06-09T09:15:54Z) - ATOM: Attention Mixer for Efficient Dataset Distillation [17.370852204228253]
We propose a module to efficiently distill large datasets using a mixture of channel and spatial-wise attention.<n>By integrating both types of attention, our ATOM module demonstrates superior performance across various computer vision datasets.
arXiv Detail & Related papers (2024-05-02T15:15:01Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - Dataset Distillation via Factorization [58.8114016318593]
We introduce a emphdataset factorization approach, termed emphHaBa, which is a plug-and-play strategy portable to any existing dataset distillation (DD) baseline.
emphHaBa explores decomposing a dataset into two components: data emphHallucination networks and emphBases.
Our method can yield significant improvement on downstream classification tasks compared with previous state of the arts, while reducing the total number of compressed parameters by up to 65%.
arXiv Detail & Related papers (2022-10-30T08:36:19Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Attribute Descent: Simulating Object-Centric Datasets on the Content
Level and Beyond [17.949962340691673]
Between synthetic and real, a two-level domain gap exists, involving content level and appearance level.
We propose an attribute descent approach that automatically optimize engine attributes to enable synthetic data to approximate real-world data.
Experiments on image classification and object re-identification confirm that adapted synthetic data can be effectively used in three scenarios.
arXiv Detail & Related papers (2022-02-28T18:58:05Z) - Synthetic Convolutional Features for Improved Semantic Segmentation [139.5772851285601]
We suggest to generate intermediate convolutional features and propose the first synthesis approach that is catered to such intermediate convolutional features.
This allows us to generate new features from label masks and include them successfully into the training procedure.
Experimental results and analysis on two challenging datasets Cityscapes and ADE20K show that our generated feature improves performance on segmentation tasks.
arXiv Detail & Related papers (2020-09-18T14:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.