CLARITY -- Comparing heterogeneous data using dissimiLARITY
- URL: http://arxiv.org/abs/2006.00077v2
- Date: Thu, 2 Dec 2021 11:36:38 GMT
- Title: CLARITY -- Comparing heterogeneous data using dissimiLARITY
- Authors: Daniel J. Lawson, Vinesh Solanki, Igor Yanovich, Johannes Dellert,
Damian Ruck and Phillip Endicott
- Abstract summary: Many scientific questions can be phrased around whether the (dis)similarities between entities are conserved across such different data.
Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise, and aids in their interpretation.
- Score: 0.39146761527401414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Integrating datasets from different disciplines is hard because the data are
often qualitatively different in meaning, scale, and reliability. When two
datasets describe the same entities, many scientific questions can be phrased
around whether the (dis)similarities between entities are conserved across such
different data. Our method, CLARITY, quantifies consistency across datasets,
identifies where inconsistencies arise, and aids in their interpretation. We
illustrate this using three diverse comparisons: gene methylation vs
expression, evolution of language sounds vs word use, and country-level
economic metrics vs cultural beliefs. The non-parametric approach is robust to
noise and differences in scaling, and makes only weak assumptions about how the
data were generated. It operates by decomposing similarities into two
components: a `structural' component analogous to a clustering, and an
underlying `relationship' between those structures. This allows a `structural
comparison' between two similarity matrices using their predictability from
`structure'. Significance is assessed with the help of re-sampling appropriate
for each dataset. The software, CLARITY, is available as an R package from
https://github.com/danjlawson/CLARITY.
Related papers
- Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning [43.75697355156703]
Noisy correspondence is prevalent on human-annotated or web-crawled datasets.
We introduce a Geometrical Structure Consistency (GSC) method to infer the true correspondence.
arXiv Detail & Related papers (2024-05-27T09:42:52Z) - A graph-structured distance for heterogeneous datasets with meta variables [1.677718351174347]
Heterogeneous datasets emerge in various machine learning or optimization applications.
The first main contribution is a modeling graph-structured framework that generalizes state-of-the-art hierarchical, tree-structured, or variable-size frameworks.
The second main contribution is the graph-structured distance that compares extended points with any combination of included and excluded variables.
arXiv Detail & Related papers (2024-05-20T23:11:03Z) - How Do Transformers Learn Topic Structure: Towards a Mechanistic
Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure"
We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z) - Relational Sentence Embedding for Flexible Semantic Matching [86.21393054423355]
We present Sentence Embedding (RSE), a new paradigm to discover further the potential of sentence embeddings.
RSE is effective and flexible in modeling sentence relations and outperforms a series of state-of-the-art embedding methods.
arXiv Detail & Related papers (2022-12-17T05:25:17Z) - Semantic Answer Similarity for Evaluating Question Answering Models [2.279676596857721]
SAS is a cross-encoder-based metric for the estimation of semantic answer similarity.
We show that semantic similarity metrics based on recent transformer models correlate much better with human judgment than traditional lexical similarity metrics.
arXiv Detail & Related papers (2021-08-13T09:12:27Z) - Mitigating Generation Shifts for Generalized Zero-Shot Learning [52.98182124310114]
Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information (e.g., attributes) to recognize the seen and unseen samples, where unseen classes are not observable during training.
We propose a novel Generation Shifts Mitigating Flow framework for learning unseen data synthesis efficiently and effectively.
Experimental results demonstrate that GSMFlow achieves state-of-the-art recognition performance in both conventional and generalized zero-shot settings.
arXiv Detail & Related papers (2021-07-07T11:43:59Z) - Commutative Lie Group VAE for Disentanglement Learning [96.32813624341833]
We view disentanglement learning as discovering an underlying structure that equivariantly reflects the factorized variations shown in data.
A simple model named Commutative Lie Group VAE is introduced to realize the group-based disentanglement learning.
Experiments show that our model can effectively learn disentangled representations without supervision, and can achieve state-of-the-art performance without extra constraints.
arXiv Detail & Related papers (2021-06-07T07:03:14Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Comparative analysis of word embeddings in assessing semantic similarity
of complex sentences [8.873705500708196]
We study the sentences in existing benchmark datasets and analyze the sensitivity of various word embeddings with respect to the complexity of the sentences.
The results show the increase in complexity of the sentences has a significant impact on the performance of the embedding models.
arXiv Detail & Related papers (2020-10-23T19:55:11Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - A Bayesian Hierarchical Score for Structure Learning from Related Data
Sets [0.7240563090941907]
We propose a new Bayesian Dirichlet score, which we call Bayesian Hierarchical Dirichlet (BHD)
BHD is based on a hierarchical model that pools information across data sets to learn a single encompassing network structure.
We find that BHD outperforms the Bayesian Dirichlet equivalent uniform (BDeu) score in terms of reconstruction accuracy as measured by the Structural Hamming distance.
arXiv Detail & Related papers (2020-08-04T16:41:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.