What is different between these datasets?
- URL: http://arxiv.org/abs/2403.05652v1
- Date: Fri, 8 Mar 2024 19:52:39 GMT
- Title: What is different between these datasets?
- Authors: Varun Babbar, Zhicheng Guo, Cynthia Rudin
- Abstract summary: Two comparable datasets in the same domain may have different distributions.
We propose a suite of interpretable methods (toolbox) for comparing two datasets.
Our methods not only outperform comparable and related approaches in terms of explanation quality and correctness, but also provide actionable, complementary insights to understand and mitigate dataset differences effectively.
- Score: 23.271594219577185
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of machine learning models heavily depends on the quality of
input data, yet real-world applications often encounter various data-related
challenges. One such challenge could arise when curating training data or
deploying the model in the real world - two comparable datasets in the same
domain may have different distributions. While numerous techniques exist for
detecting distribution shifts, the literature lacks comprehensive approaches
for explaining dataset differences in a human-understandable manner. To address
this gap, we propose a suite of interpretable methods (toolbox) for comparing
two datasets. We demonstrate the versatility of our approach across diverse
data modalities, including tabular data, language, images, and signals in both
low and high-dimensional settings. Our methods not only outperform comparable
and related approaches in terms of explanation quality and correctness, but
also provide actionable, complementary insights to understand and mitigate
dataset differences effectively.
Related papers
- Flexible inference in heterogeneous and attributed multilayer networks [21.349513661012498]
We develop a probabilistic generative model to perform inference in multilayer networks with arbitrary types of information.
We demonstrate its ability to unveil a variety of patterns in a social support network among villagers in rural India.
arXiv Detail & Related papers (2024-05-31T15:21:59Z) - Interpretable Tensor Fusion [26.314148163750257]
We introduce interpretable tensor fusion (InTense), a method for training neural networks to simultaneously learn multimodal data representations.
InTense provides interpretability out of the box by assigning relevance scores to modalities and their associations.
Experiments on six real-world datasets show that InTense outperforms existing state-of-the-art multimodal interpretable approaches in terms of accuracy and interpretability.
arXiv Detail & Related papers (2024-05-07T21:05:50Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - SALUDA: Surface-based Automotive Lidar Unsupervised Domain Adaptation [62.889835139583965]
We introduce an unsupervised auxiliary task of learning an implicit underlying surface representation simultaneously on source and target data.
As both domains share the same latent representation, the model is forced to accommodate discrepancies between the two sources of data.
Our experiments demonstrate that our method achieves a better performance than the current state of the art, both in real-to-real and synthetic-to-real scenarios.
arXiv Detail & Related papers (2023-04-06T17:36:23Z) - Metadata Archaeology: Unearthing Data Subsets by Leveraging Training
Dynamics [3.9627732117855414]
We focus on providing a unified and efficient framework for Metadata Archaeology.
We curate different subsets of data that might exist in a dataset.
We leverage differences in learning dynamics between these probe suites to infer metadata of interest.
arXiv Detail & Related papers (2022-09-20T21:52:39Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Enhancing ensemble learning and transfer learning in multimodal data
analysis by adaptive dimensionality reduction [10.646114896709717]
In multimodal data analysis, not all observations would show the same level of reliability or information quality.
We propose an adaptive approach for dimensionality reduction to overcome this issue.
We test our approach on multimodal datasets acquired in diverse research fields.
arXiv Detail & Related papers (2021-05-08T11:53:12Z) - DAIL: Dataset-Aware and Invariant Learning for Face Recognition [67.4903809903022]
To achieve good performance in face recognition, a large scale training dataset is usually required.
It is problematic and troublesome to naively combine different datasets due to two major issues.
Naively treating the same person as different classes in different datasets during training will affect back-propagation.
manually cleaning labels may take formidable human efforts, especially when there are millions of images and thousands of identities.
arXiv Detail & Related papers (2021-01-14T01:59:52Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z) - Meta Learning for Causal Direction [29.00522306460408]
We introduce a novel generative model that allows distinguishing cause and effect in the small data setting.
We demonstrate our method on various synthetic as well as real-world data and show that it is able to maintain high accuracy in detecting directions across varying dataset sizes.
arXiv Detail & Related papers (2020-07-06T15:12:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.