DINO-MX: A Modular & Flexible Framework for Self-Supervised Learning
- URL: http://arxiv.org/abs/2511.01610v1
- Date: Mon, 03 Nov 2025 14:10:43 GMT
- Title: DINO-MX: A Modular & Flexible Framework for Self-Supervised Learning
- Authors: Mahmut Selman Gokmen, Cody Bumgardner,
- Abstract summary: DINO-MX is a modular and training framework that combines the core principles of DINO, DINOv2 and DINOv3.<n>It supports a variety of transformer-based architectures and is fully compatible with the Hugging Face ecosystem.<n>It is designed to work with both natural and specialized data types, including single- and multi-channel images.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Foundation Models (VFMs) have advanced representation learning through self-supervised methods. However, existing training pipelines are often inflexible, domain-specific, or computationally expensive, which limits their usability across different domains and resource settings. DINO-MX is a modular and extensible training framework that combines the core principles of DINO, DINOv2 and DINOv3 within a unified configuration-driven system. It supports a variety of transformer-based architectures and is fully compatible with the Hugging Face ecosystem. The framework includes multiple training strategies such as low-rank adaptation (LoRA), layer freezing, and knowledge distillation, along with support for distributed training through both Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP). DINO-MX is designed to work with both natural and specialized data types, including single- and multi-channel images. Experimental results on diverse datasets show that DINO-MX achieves competitive performance while significantly reducing computational costs. Additionally, it offers interpretability tools and a label-guided data augmentation method that improves attention-based localization without the need for extra detection or segmentation heads. DINO-MX provides a reproducible and scalable foundation for developing, adapting, and benchmarking self-supervised vision models across a range of research and real-world applications.
Related papers
- Multi-modal Co-learning for Earth Observation: Enhancing single-modality models via modality collaboration [9.66105329596482]
We propose a novel multi-modal co-learning framework capable of generalizing across various tasks without targeting a specific modality for inference.<n>Our approach combines contrastive and modality discriminative learning together to guide single-modality models to structure the internal model manifold into modality-shared and modality-specific information.
arXiv Detail & Related papers (2025-10-22T13:29:32Z) - AXLearn: Modular Large Model Training on Heterogeneous Infrastructure [64.33868455931301]
AXLearn is a production deep learning system that facilitates scalable and high-performance training of large deep learning models.<n>Compared to other state-of-the-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure.
arXiv Detail & Related papers (2025-07-07T18:50:58Z) - XxaCT-NN: Structure Agnostic Multimodal Learning for Materials Science [0.27185251060695437]
We propose a scalable framework that learns directly from elemental composition and X-ray diffraction (XRD)<n>Our architecture integrates modality-specific encoders with a cross-attention fusion module and is trained on the 5-million-sample Alexandria dataset.<n>Our results establish a path toward structure-free, experimentally grounded foundation models for materials science.
arXiv Detail & Related papers (2025-06-27T21:45:56Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition [12.382193259575805]
We propose a novel multi-modality co-learning (MMCL) framework for efficient skeleton-based action recognition.
Our MMCL framework engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference.
arXiv Detail & Related papers (2024-07-22T15:16:47Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - Noise-powered Multi-modal Knowledge Graph Representation Framework [52.95468915728721]
The rise of Multi-modal Pre-training highlights the necessity for a unified Multi-Modal Knowledge Graph representation learning framework.<n>We propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking.<n>Our approach achieves SOTA performance across a total of ten datasets, demonstrating its versatility.
arXiv Detail & Related papers (2024-03-11T15:48:43Z) - Domain-Agnostic Neural Architecture for Class Incremental Continual
Learning in Document Processing Platform [3.630365560970225]
Recent methods with learning gradient have been shown to struggle in such setups or have limitations like memory buffers.
We present a fully differentiable architecture based on the Mixture of Experts model, that enables the training of high-performance classifiers when examples from each class are presented separately.
We conducted exhaustive experiments that proved its applicability in various domains and ability to learn online in production environments.
arXiv Detail & Related papers (2023-07-11T16:01:44Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z) - SINGA-Easy: An Easy-to-Use Framework for MultiModal Analysis [18.084628500554462]
We introduce SINGA-Easy, a new deep learning framework that provides distributed hyper- parameter tuning at the training stage, dynamic computational cost control at the inference stage, and intuitive user interactions with multimedia contents facilitated by model explanation.
Our experiments on the training and deployment of multi-modality data analysis applications show that the framework is both usable and adaptable to dynamic inference loads.
arXiv Detail & Related papers (2021-08-03T08:39:54Z) - Edge-assisted Democratized Learning Towards Federated Analytics [67.44078999945722]
We show the hierarchical learning structure of the proposed edge-assisted democratized learning mechanism, namely Edge-DemLearn.
We also validate Edge-DemLearn as a flexible model training mechanism to build a distributed control and aggregation methodology in regions.
arXiv Detail & Related papers (2020-12-01T11:46:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.