Exploring Multilingual Text Data Distillation
- URL: http://arxiv.org/abs/2308.04982v1
- Date: Wed, 9 Aug 2023 14:31:57 GMT
- Title: Exploring Multilingual Text Data Distillation
- Authors: Shivam Sahni, Harsh Patel
- Abstract summary: We propose several data distillation techniques for multilingual text classification datasets using language-model-based learning methods.
We conduct experiments to analyze their performance in terms of classification strength, and cross-architecture generalization.
Our approach builds upon existing techniques, enhancing cross-architecture generalization in the text data distillation domain.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rise of deep learning, large datasets and complex models have become
common, requiring significant computing power. To address this, data
distillation has emerged as a technique to quickly train models with lower
memory and time requirements. However, data distillation on text-based datasets
hasn't been explored much because of the challenges rising due to its discrete
nature. Additionally, existing dataset distillation methods often struggle to
generalize to new architectures. In the paper, we propose several data
distillation techniques for multilingual text classification datasets using
language-model-based learning methods. We conduct experiments to analyze their
performance in terms of classification strength, and cross-architecture
generalization. Furthermore, we investigate the language-specific fairness of
the data summaries generated by these methods. Our approach builds upon
existing techniques, enhancing cross-architecture generalization in the text
data distillation domain.
Related papers
- Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution [7.681258910515419]
Tabular data presents unique challenges due to its heterogeneous nature and complex structural relationships.
High predictive performance and robustness in tabular data analysis holds significant promise for numerous applications.
The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning.
arXiv Detail & Related papers (2024-08-20T04:59:19Z) - Generative Dataset Distillation: Balancing Global Structure and Local Details [49.20086587208214]
We propose a new dataset distillation method that considers balancing global structure and local details.
Our method involves using a conditional generative adversarial network to generate the distilled dataset.
arXiv Detail & Related papers (2024-04-26T23:46:10Z) - One Category One Prompt: Dataset Distillation using Diffusion Models [22.512552596310176]
We introduce Diffusion Models (D3M) as a novel paradigm for dataset distillation, leveraging recent advancements in generative text-to-image foundation models.
Our approach utilizes textual inversion, a technique for fine-tuning text-to-image generative models, to create concise and informative representations for large datasets.
arXiv Detail & Related papers (2024-03-11T20:23:59Z) - Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines.
We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model.
It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Generalizing Dataset Distillation via Deep Generative Prior [75.9031209877651]
We propose to distill an entire dataset's knowledge into a few synthetic images.
The idea is to synthesize a small number of synthetic data points that, when given to a learning algorithm as training data, result in a model approximating one trained on the original data.
We present a new optimization algorithm that distills a large number of images into a few intermediate feature vectors in the generative model's latent space.
arXiv Detail & Related papers (2023-05-02T17:59:31Z) - A Comprehensive Survey of Dataset Distillation [73.15482472726555]
It has become challenging to handle the unlimited growth of data with limited computing power.
Deep learning technology has developed unprecedentedly in the last decade.
This paper provides a holistic understanding of dataset distillation from multiple aspects.
arXiv Detail & Related papers (2023-01-13T15:11:38Z) - DC-BENCH: Dataset Condensation Benchmark [79.18718490863908]
This work provides the first large-scale standardized benchmark on dataset condensation.
It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods.
The benchmark library is open-sourced to facilitate future research and application.
arXiv Detail & Related papers (2022-07-20T03:54:05Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.