RepMatch: Quantifying Cross-Instance Similarities in Representation Space
- URL: http://arxiv.org/abs/2410.09642v1
- Date: Sat, 12 Oct 2024 20:42:28 GMT
- Title: RepMatch: Quantifying Cross-Instance Similarities in Representation Space
- Authors: Mohammad Reza Modarres, Sina Abbasi, Mohammad Taher Pilehvar,
- Abstract summary: We introduce RepMatch, a novel method that characterizes data through the lens of similarity.
RepMatch quantifies the similarity between subsets of training instances by comparing the knowledge encoded in models trained on them.
We validate the effectiveness of RepMatch across multiple NLP tasks, datasets, and models.
- Score: 15.215985417763472
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Advances in dataset analysis techniques have enabled more sophisticated approaches to analyzing and characterizing training data instances, often categorizing data based on attributes such as ``difficulty''. In this work, we introduce RepMatch, a novel method that characterizes data through the lens of similarity. RepMatch quantifies the similarity between subsets of training instances by comparing the knowledge encoded in models trained on them, overcoming the limitations of existing analysis methods that focus solely on individual instances and are restricted to within-dataset analysis. Our framework allows for a broader evaluation, enabling similarity comparisons across arbitrary subsets of instances, supporting both dataset-to-dataset and instance-to-dataset analyses. We validate the effectiveness of RepMatch across multiple NLP tasks, datasets, and models. Through extensive experimentation, we demonstrate that RepMatch can effectively compare datasets, identify more representative subsets of a dataset (that lead to better performance than randomly selected subsets of equivalent size), and uncover heuristics underlying the construction of some challenge datasets.
Related papers
- A Similarity-Based Oversampling Method for Multi-label Imbalanced Text Data [1.799933345199395]
This study introduces and examines a novel oversampling method for multi-label text classification.
The proposed method identifies potential new samples from unlabeled data by leveraging similarity measures between instances.
By iteratively searching the unlabeled dataset, the method locates instances similar to those in underrepresented classes.
Instances that demonstrate performance improvement are then added to the labeled dataset.
arXiv Detail & Related papers (2024-11-01T20:33:49Z) - A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups.
We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Adaptive Sampling Strategies to Construct Equitable Training Datasets [0.7036032466145111]
In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities.
One factor contributing to these performance gaps is a lack of representation in the data the models are trained on.
We formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem.
arXiv Detail & Related papers (2022-01-31T19:19:30Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Interactive Dimensionality Reduction for Comparative Analysis [28.52130400665133]
We introduce an interactive DR framework where we integrate our new DR method, called ULCA, with an interactive visual interface.
ULCA unifies two DR schemes, discriminant analysis and contrastive learning, to support various comparative analysis tasks.
We develop an optimization algorithm that enables analysts to interactively refine ULCA results.
arXiv Detail & Related papers (2021-06-29T15:05:36Z) - Capturing patterns of variation unique to a specific dataset [68.8204255655161]
We propose a tuning-free method that identifies low-dimensional representations of a target dataset relative to one or more comparison datasets.
We show in several experiments that UCA with a single background dataset achieves similar results compared to cPCA with various tuning parameters.
arXiv Detail & Related papers (2021-04-16T15:07:32Z) - Comparative analysis of word embeddings in assessing semantic similarity
of complex sentences [8.873705500708196]
We study the sentences in existing benchmark datasets and analyze the sensitivity of various word embeddings with respect to the complexity of the sentences.
The results show the increase in complexity of the sentences has a significant impact on the performance of the embedding models.
arXiv Detail & Related papers (2020-10-23T19:55:11Z) - CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural
Summarization Systems [121.78477833009671]
We investigate the performance of different summarization models under a cross-dataset setting.
A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways.
arXiv Detail & Related papers (2020-10-11T02:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.