CLARITY -- Comparing heterogeneous data using dissimiLARITY
- URL: http://arxiv.org/abs/2006.00077v2
- Date: Thu, 2 Dec 2021 11:36:38 GMT
- Title: CLARITY -- Comparing heterogeneous data using dissimiLARITY
- Authors: Daniel J. Lawson, Vinesh Solanki, Igor Yanovich, Johannes Dellert,
Damian Ruck and Phillip Endicott
- Abstract summary: Many scientific questions can be phrased around whether the (dis)similarities between entities are conserved across such different data.
Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise, and aids in their interpretation.
- Score: 0.39146761527401414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Integrating datasets from different disciplines is hard because the data are
often qualitatively different in meaning, scale, and reliability. When two
datasets describe the same entities, many scientific questions can be phrased
around whether the (dis)similarities between entities are conserved across such
different data. Our method, CLARITY, quantifies consistency across datasets,
identifies where inconsistencies arise, and aids in their interpretation. We
illustrate this using three diverse comparisons: gene methylation vs
expression, evolution of language sounds vs word use, and country-level
economic metrics vs cultural beliefs. The non-parametric approach is robust to
noise and differences in scaling, and makes only weak assumptions about how the
data were generated. It operates by decomposing similarities into two
components: a `structural' component analogous to a clustering, and an
underlying `relationship' between those structures. This allows a `structural
comparison' between two similarity matrices using their predictability from
`structure'. Significance is assessed with the help of re-sampling appropriate
for each dataset. The software, CLARITY, is available as an R package from
https://github.com/danjlawson/CLARITY.
Related papers
- Measuring similarity between embedding spaces using induced neighborhood graphs [10.056989400384772]
We propose a metric to evaluate the similarity between paired item representations.
Our results show that accuracy in both analogy and zero-shot classification tasks correlates with the embedding similarity.
arXiv Detail & Related papers (2024-11-13T15:22:33Z) - SEG:Seeds-Enhanced Iterative Refinement Graph Neural Network for Entity Alignment [13.487673375206276]
This paper presents a soft label propagation framework that integrates multi-source data and iterative seed enhancement.
A bidirectional weighted joint loss function is implemented, which reduces the distance between positive samples and differentially processes negative samples.
Our method outperforms existing semi-supervised approaches, as evidenced by superior results on multiple datasets.
arXiv Detail & Related papers (2024-10-28T04:50:46Z) - Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning [43.75697355156703]
Noisy correspondence is prevalent on human-annotated or web-crawled datasets.
We introduce a Geometrical Structure Consistency (GSC) method to infer the true correspondence.
arXiv Detail & Related papers (2024-05-27T09:42:52Z) - How Do Transformers Learn Topic Structure: Towards a Mechanistic
Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure"
We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z) - Relational Sentence Embedding for Flexible Semantic Matching [86.21393054423355]
We present Sentence Embedding (RSE), a new paradigm to discover further the potential of sentence embeddings.
RSE is effective and flexible in modeling sentence relations and outperforms a series of state-of-the-art embedding methods.
arXiv Detail & Related papers (2022-12-17T05:25:17Z) - Mitigating Generation Shifts for Generalized Zero-Shot Learning [52.98182124310114]
Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information (e.g., attributes) to recognize the seen and unseen samples, where unseen classes are not observable during training.
We propose a novel Generation Shifts Mitigating Flow framework for learning unseen data synthesis efficiently and effectively.
Experimental results demonstrate that GSMFlow achieves state-of-the-art recognition performance in both conventional and generalized zero-shot settings.
arXiv Detail & Related papers (2021-07-07T11:43:59Z) - Commutative Lie Group VAE for Disentanglement Learning [96.32813624341833]
We view disentanglement learning as discovering an underlying structure that equivariantly reflects the factorized variations shown in data.
A simple model named Commutative Lie Group VAE is introduced to realize the group-based disentanglement learning.
Experiments show that our model can effectively learn disentangled representations without supervision, and can achieve state-of-the-art performance without extra constraints.
arXiv Detail & Related papers (2021-06-07T07:03:14Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Comparative analysis of word embeddings in assessing semantic similarity
of complex sentences [8.873705500708196]
We study the sentences in existing benchmark datasets and analyze the sensitivity of various word embeddings with respect to the complexity of the sentences.
The results show the increase in complexity of the sentences has a significant impact on the performance of the embedding models.
arXiv Detail & Related papers (2020-10-23T19:55:11Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - A Bayesian Hierarchical Score for Structure Learning from Related Data
Sets [0.7240563090941907]
We propose a new Bayesian Dirichlet score, which we call Bayesian Hierarchical Dirichlet (BHD)
BHD is based on a hierarchical model that pools information across data sets to learn a single encompassing network structure.
We find that BHD outperforms the Bayesian Dirichlet equivalent uniform (BDeu) score in terms of reconstruction accuracy as measured by the Structural Hamming distance.
arXiv Detail & Related papers (2020-08-04T16:41:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.