Related papers: Towards High-Performance Exploratory Data Analysis (EDA) Via Stable Equilibrium Point

Related papers

Extending Dataset Pruning to Object Detection: A Variance-based Approach [0.0]
We present the first extension of classification pruning techniques to the object detection domain.<n>We propose tailored solutions, including a novel scoring method called Variance-based Prediction Score (VPS)<n>Our work bridges dataset pruning and object detection, paving the way for dataset pruning in complex vision tasks.
arXiv Detail & Related papers (2025-05-22T19:46:51Z)
Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation [4.030723722142048]
This paper tackles the challenges associated with the unstructured and heterogeneous nature of webcrawl datasets. We introduce an advanced, learning-driven approach, Ensemble Curation Of DAta ThroUgh Multimodal Operators (EcoDatum) EcoDatum strategically integrates various unimodal and multimodal data curation operators within a weak supervision ensemble framework. It ranked 1st on the DataComp leaderboard, with an average performance score of 0.182 across 38 diverse evaluation datasets.
arXiv Detail & Related papers (2025-02-12T08:40:57Z)
Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search [59.75749613951193]
We propose Data Influence-oriented Tree Search (DITS) to guide both tree search and data selection. By leveraging influence scores, we effectively identify the most impactful data for system improvement. We derive influence score estimation methods tailored for non-differentiable metrics.
arXiv Detail & Related papers (2025-02-02T23:20:16Z)
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models [79.65071553905021]
We propose Data Advisor, a method for generating data that takes into account the characteristics of the desired dataset. Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation.
arXiv Detail & Related papers (2024-10-07T17:59:58Z)
Targeted synthetic data generation for tabular data via hardness characterization [0.0]
We introduce a simple augmentation pipeline that generates only high-value training points based on hardness characterization. Our approach improves the quality of out-of-sample predictions and it is computationally more efficient compared to non-targeted methods.
arXiv Detail & Related papers (2024-10-01T14:54:26Z)
Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets. dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset. We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z)
A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data. We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z)
Dataset Distillation via the Wasserstein Metric [34.06251608504682]
We introduce WMDD (Wasserstein Metric-based dataset Distillation), a straightforward yet powerful method that employs the Wasserstein metric to enhance distribution matching.<n>Our experiments demonstrate WMDD's effectiveness and adaptability, highlighting its potential for advancing machine learning applications at scale.
arXiv Detail & Related papers (2023-11-30T13:15:28Z)
Data-Centric Long-Tailed Image Recognition [49.90107582624604]
Long-tail models exhibit a strong demand for high-quality data. Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance. There is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation.
arXiv Detail & Related papers (2023-11-03T06:34:37Z)
A Comparative Evaluation of FedAvg and Per-FedAvg Algorithms for Dirichlet Distributed Heterogeneous Data [2.5507252967536522]
We investigate Federated Learning (FL), a paradigm of machine learning that allows for decentralized model training on devices without sharing raw data. We compare two strategies within this paradigm: Federated Averaging (FedAvg) and Personalized Federated Averaging (Per-FedAvg) Our results provide insights into the development of more effective and efficient machine learning strategies in a decentralized setting.
arXiv Detail & Related papers (2023-09-03T21:33:15Z)
DBGSA: A Novel Data Adaptive Bregman Clustering Algorithm [2.0232038310495435]
We present a clustering algorithm that is highly sensitive to the initial selection and robustness of datasets. Extensive experiments are conducted on four simulated datasets six real datasets. Results demonstrate that our algorithm improves the accuracy of various algorithms by an average of 63.8%.
arXiv Detail & Related papers (2023-07-25T16:37:09Z)
Towards Efficient Deep Hashing Retrieval: Condensing Your Data via Feature-Embedding Matching [7.908244841289913]
The expenses involved in training state-of-the-art deep hashing retrieval models have witnessed an increase. The state-of-the-art dataset distillation methods can not expand to all deep hashing retrieval methods. We propose an efficient condensation framework that addresses these limitations by matching the feature-embedding between synthetic set and real set.
arXiv Detail & Related papers (2023-05-29T13:23:55Z)
Adaptive Weighted Multiview Kernel Matrix Factorization with its application in Alzheimer's Disease Analysis -- A clustering Perspective [3.3843930118195407]
We propose a novel model to leverage data from all different modalities/views, which can learn the weights of each view adaptively. Experimental results on ADNI dataset demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-03-07T16:05:24Z)
Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER) Our method exploits self-supervised pretraining to learn good feature representations from the target data. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z)
Another Use of SMOTE for Interpretable Data Collaboration Analysis [8.143750358586072]
Data collaboration (DC) analysis has been developed for privacy-preserving integrated analysis across multiple institutions. This study proposes an anchor data construction technique to improve the recognition performance without increasing the risk of data leakage.
arXiv Detail & Related papers (2022-08-26T06:39:13Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.