ML-driven detection and reduction of ballast information in multi-modal datasets
- URL: http://arxiv.org/abs/2602.16876v1
- Date: Wed, 18 Feb 2026 21:01:05 GMT
- Title: ML-driven detection and reduction of ballast information in multi-modal datasets
- Authors: Yaroslav Solovko,
- Abstract summary: ballast is redundant or low-utility information that increases dimensionality, storage requirements, and computational cost.<n>This study introduces a generalized, multimodal framework for ballast detection and reduction across structured, semi-structured, unstructured, and sparse data types.<n>A novel Ballast Score is proposed to integrate these signals into a unified, cross-modal pruning strategy.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern datasets often contain ballast as redundant or low-utility information that increases dimensionality, storage requirements, and computational cost without contributing meaningful analytical value. This study introduces a generalized, multimodal framework for ballast detection and reduction across structured, semi-structured, unstructured, and sparse data types. Using diverse datasets, entropy, mutual information, Lasso, SHAP, PCA, topic modelling, and embedding analysis are applied to identify and eliminate ballast features. A novel Ballast Score is proposed to integrate these signals into a unified, cross-modal pruning strategy. Experimental results demonstrate that significant portions of the feature space as often exceeding 70% in sparse or semi-structured data, can be pruned with minimal or even improved classification performance, along with substantial reductions in training time and memory footprint. The framework reveals distinct ballast typologies (e.g. statistical, semantic, infrastructural), and offers practical guidance for leaner, more efficient machine learning pipelines.
Related papers
- Localized Kernel Projection Outlyingness: A Two-Stage Approach for Multi-Modal Outlier Detection [0.0]
Two-Stage LKPLO is a novel multi-stage outlier detection framework.<n>It overcomes the coexisting limitations of conventional projection-based methods.<n>It achieves state-of-the-art performance on challenging datasets.
arXiv Detail & Related papers (2025-10-28T03:53:46Z) - Analytical Survey of Learning with Low-Resource Data: From Analysis to Investigation [192.53529928861818]
Learning with high-resource data has demonstrated substantial success in artificial intelligence (AI)<n>However, the costs associated with data annotation and model training remain significant.<n>This survey employs active sampling theory to analyze the generalization error and label complexity associated with learning from low-resource data.
arXiv Detail & Related papers (2025-10-10T03:15:42Z) - TabINR: An Implicit Neural Representation Framework for Tabular Data Imputation [0.6407815281667869]
We introduce TabINR, an auto-decoder based Implicit Neural Representation framework that models tables as neural functions.<n>We evaluate our framework across a diverse range of twelve real-world datasets and multiple missingness mechanisms.
arXiv Detail & Related papers (2025-10-01T17:24:35Z) - SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - Efficient Quantification of Multimodal Interaction at Sample Level [12.373485315058513]
We introduce the Lightweight Sample-wise Multimodal Interaction (LSMI) estimator, rigorously grounded in pointwise information theory.<n>We first develop a redundancy estimation framework, employing an appropriate pointwise information measure to quantify this most decomposable interaction.<n>Building upon this, we propose a general interaction estimation method that employs efficient entropy estimation.
arXiv Detail & Related papers (2025-06-08T02:39:25Z) - AdvKT: An Adversarial Multi-Step Training Framework for Knowledge Tracing [64.79967583649407]
Knowledge Tracing (KT) monitors students' knowledge states and simulates their responses to question sequences.<n>Existing KT models typically follow a single-step training paradigm, which leads to significant error accumulation.<n>We propose a novel Adversarial Multi-Step Training Framework for Knowledge Tracing (AdvKT) which focuses on the multi-step KT task.
arXiv Detail & Related papers (2025-04-07T03:31:57Z) - Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.<n>We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - Going Beyond Feature Similarity: Effective Dataset Distillation based on Class-Aware Conditional Mutual Information [43.44508080585033]
We introduce conditional mutual information (CMI) to assess the class-aware complexity of a dataset.<n>We minimize the distillation loss while constraining the class-aware complexity of the synthetic dataset.
arXiv Detail & Related papers (2024-12-13T08:10:47Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.<n>Data selection has shown promise in identifying the most representative samples from the entire dataset.<n>We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Attribute-based Explanations of Non-Linear Embeddings of
High-Dimensional Data [2.397739143553337]
Non-linear Embeddings Surveyor (NoLiES) combines a novel augmentation strategy for projected data (rangesets) with interactive analysis in a small multiples setting.
Rangesets use a set-based visualization approach for binned attribute values that enable the user to quickly observe structure and detect outliers.
arXiv Detail & Related papers (2021-07-28T12:09:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.