MNIST-Gen: A Modular MNIST-Style Dataset Generation Using Hierarchical Semantics, Reinforcement Learning, and Category Theory
- URL: http://arxiv.org/abs/2507.11821v1
- Date: Wed, 16 Jul 2025 00:50:09 GMT
- Title: MNIST-Gen: A Modular MNIST-Style Dataset Generation Using Hierarchical Semantics, Reinforcement Learning, and Category Theory
- Authors: Pouya Shaeri, Arash Karimi, Ariane Middel,
- Abstract summary: We present MNIST-Gen, an automated, modular, and adaptive framework for generating MNIST-style image datasets tailored to user-specified categories.<n>The system combines CLIP-based semantic understanding with reinforcement learning and human feedback to achieve intelligent categorization with minimal manual intervention.<n>We benchmark two novel datasets-textitTree-MNIST and textitFood-MNIST-demonstrating MNIST-Gen's utility for producing task-specific evaluation data.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural networks are often benchmarked using standard datasets such as MNIST, FashionMNIST, or other variants of MNIST, which, while accessible, are limited to generic classes such as digits or clothing items. For researchers working on domain-specific tasks, such as classifying trees, food items, or other real-world objects, these data sets are insufficient and irrelevant. Additionally, creating and publishing a custom dataset can be time consuming, legally constrained, or beyond the scope of individual projects. We present MNIST-Gen, an automated, modular, and adaptive framework for generating MNIST-style image datasets tailored to user-specified categories using hierarchical semantic categorization. The system combines CLIP-based semantic understanding with reinforcement learning and human feedback to achieve intelligent categorization with minimal manual intervention. Our hierarchical approach supports complex category structures with semantic characteristics, enabling fine-grained subcategorization and multiple processing modes: individual review for maximum control, smart batch processing for large datasets, and fast batch processing for rapid creation. Inspired by category theory, MNIST-Gen models each data transformation stage as a composable morphism, enhancing clarity, modularity, and extensibility. As proof of concept, we generate and benchmark two novel datasets-\textit{Tree-MNIST} and \textit{Food-MNIST}-demonstrating MNIST-Gen's utility for producing task-specific evaluation data while achieving 85\% automatic categorization accuracy and 80\% time savings compared to manual approaches.
Related papers
- HERCULES: Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization [0.0]
HERCULES is an algorithm and Python package designed for hierarchical k-means clustering of diverse data types.<n>It generates semantically rich titles and descriptions for clusters at each level of the hierarchy.<n>An interactive visualization tool facilitates thorough analysis and understanding of the clustering results.
arXiv Detail & Related papers (2025-06-24T20:22:00Z) - AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora [51.77079220622184]
We present AutoKG, a framework for fully autonomous knowledge graph construction.<n>We leverage large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text.<n>We construct ATLAS (Automated Triple Linking And Induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges.
arXiv Detail & Related papers (2025-05-29T16:34:58Z) - ORIGAMI: A generative transformer architecture for predictions from semi-structured data [3.5639148953570836]
ORIGAMI is a transformer-based architecture that processes nested key/value pairs.<n>By reformulating classification as next-token prediction, ORIGAMI naturally handles both single-label and multi-label tasks.
arXiv Detail & Related papers (2024-12-23T07:21:17Z) - Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring.
Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations.
Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z) - How to Make LLMs Strong Node Classifiers? [70.14063765424012]
Language Models (LMs) are challenging the dominance of domain-specific models, such as Graph Neural Networks (GNNs) and Graph Transformers (GTs)<n>We propose a novel approach that empowers off-the-shelf LMs to achieve performance comparable to state-of-the-art (SOTA) GNNs on node classification tasks.
arXiv Detail & Related papers (2024-10-03T08:27:54Z) - Exploiting Contextual Target Attributes for Target Sentiment
Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task.
We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z) - How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model [47.617093812158366]
We introduce the Random Hierarchy Model: a family of synthetic tasks inspired by the hierarchical structure of language and images.
We find that deep networks learn the task by developing internal representations invariant to exchanging equivalent groups.
Our results indicate how deep networks overcome the curse of dimensionality by building invariant representations.
arXiv Detail & Related papers (2023-07-05T09:11:09Z) - Chain-of-Skills: A Configurable Model for Open-domain Question Answering [79.8644260578301]
The retrieval model is an indispensable component for real-world knowledge-intensive tasks.
Recent work focuses on customized methods, limiting the model transferability and scalability.
We propose a modular retriever where individual modules correspond to key skills that can be reused across datasets.
arXiv Detail & Related papers (2023-05-04T20:19:39Z) - SetVAE: Learning Hierarchical Composition for Generative Modeling of
Set-Structured Data [27.274328701618]
We propose SetVAE, a hierarchical variational autoencoder for sets.
Motivated by recent progress in set encoding, we build SetVAE upon attentive modules that first partition the set and project the partition back to the original cardinality.
We demonstrate that our model generalizes to unseen set sizes and learns interesting subset relations without supervision.
arXiv Detail & Related papers (2021-03-29T14:01:18Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Fuzzy Simplicial Networks: A Topology-Inspired Model to Improve Task
Generalization in Few-shot Learning [1.0062040918634414]
Few-shot learning algorithms are designed to generalize well to new tasks with limited data.
We introduce a new few-shot model called Fuzzy Simplicial Networks (FSN) which leverages a construction from topology to more flexibly represent each class from limited data.
arXiv Detail & Related papers (2020-09-23T17:01:09Z) - High-order structure preserving graph neural network for few-shot
learning [10.296473510866228]
Few-shot learning can find the latent structure information between the prior knowledge and the queried data by the similarity metric of meta-learning.
Most existing methods try to model the similarity relationship of the samples in the intra tasks, and generalize the model to identify the new categories.
The proposed high-order structure preserving graph neural network(HOSP-GNN) can explore the rich structure of the samples to predict the label of the queried data on graph.
arXiv Detail & Related papers (2020-05-29T06:38:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.