Neural sentence embedding models for semantic similarity estimation in
the biomedical domain
- URL: http://arxiv.org/abs/2110.15708v1
- Date: Fri, 1 Oct 2021 13:27:44 GMT
- Title: Neural sentence embedding models for semantic similarity estimation in
the biomedical domain
- Authors: Kathrin Blagec, Hong Xu, Asan Agibetov, Matthias Samwald
- Abstract summary: We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset.
We evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts.
- Score: 6.325814141416726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: BACKGROUND: In this study, we investigated the efficacy of current
state-of-the-art neural sentence embedding models for semantic similarity
estimation of sentences from biomedical literature. We trained different neural
embedding models on 1.7 million articles from the PubMed Open Access dataset,
and evaluated them based on a biomedical benchmark set containing 100 sentence
pairs annotated by human experts and a smaller contradiction subset derived
from the original benchmark set.
RESULTS: With a Pearson correlation of 0.819, our best unsupervised model
based on the Paragraph Vector Distributed Memory algorithm outperforms previous
state-of-the-art results achieved on the BIOSSES biomedical benchmark set.
Moreover, our proposed supervised model that combines different string-based
similarity metrics with a neural embedding model surpasses previous
ontology-dependent supervised state-of-the-art approaches in terms of Pearson's
r (r=0.871) on the biomedical benchmark set. In contrast to the promising
results for the original benchmark, we found our best models' performance on
the smaller contradiction subset to be poor.
CONCLUSIONS: In this study we highlighted the value of neural network-based
models for semantic similarity estimation in the biomedical domain by showing
that they can keep up with and even surpass previous state-of-the-art
approaches for semantic similarity estimation that depend on the availability
of laboriously curated ontologies when evaluated on a biomedical benchmark set.
Capturing contradictions and negations in biomedical sentences, however,
emerged as an essential area for further work.
Related papers
- Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all [1.507700065820919]
Recent advancements in transcriptomics sequencing provide new opportunities to uncover valuable insights.
No benchmark has been made to robustly evaluate the effectiveness of these rising models for perturbation analysis.
This article presents a novel biologically motivated evaluation framework and a hierarchy of perturbation analysis tasks.
arXiv Detail & Related papers (2024-10-17T18:27:51Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - Counterfactual Data Augmentation with Contrastive Learning [27.28511396131235]
We introduce a model-agnostic data augmentation method that imputes the counterfactual outcomes for a selected subset of individuals.
We use contrastive learning to learn a representation space and a similarity measure such that in the learned representation space close individuals identified by the learned similarity measure have similar potential outcomes.
This property ensures reliable imputation of counterfactual outcomes for the individuals with close neighbors from the alternative treatment group.
arXiv Detail & Related papers (2023-11-07T00:36:51Z) - Simulation-based Inference for Cardiovascular Models [57.92535897767929]
We use simulation-based inference to solve the inverse problem of mapping waveforms back to plausible physiological parameters.
We perform an in-silico uncertainty analysis of five biomarkers of clinical interest.
We study the gap between in-vivo and in-silico with the MIMIC-III waveform database.
arXiv Detail & Related papers (2023-07-26T02:34:57Z) - A Generative Modeling Framework for Inferring Families of Biomechanical
Constitutive Laws in Data-Sparse Regimes [0.15658704610960567]
We propose a novel approach to efficiently infer families of relationships in data-sparse regimes.
Inspired by the concept of functional priors, we develop a generative network (GAN) that incorporates a neural operator as the generator and a fully-connected network as the adversarial discriminator.
arXiv Detail & Related papers (2023-05-04T22:07:27Z) - MAUVE Scores for Generative Models: Theory and Practice [95.86006777961182]
We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images.
We find that MAUVE can quantify the gaps between the distributions of human-written text and those of modern neural language models.
We demonstrate in the vision domain that MAUVE can identify known properties of generated images on par with or better than existing metrics.
arXiv Detail & Related papers (2022-12-30T07:37:40Z) - Are Neural Topic Models Broken? [81.15470302729638]
We study the relationship between automated and human evaluation of topic models.
We find that neural topic models fare worse in both respects compared to an established classical method.
arXiv Detail & Related papers (2022-10-28T14:38:50Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z) - Multi-Ontology Refined Embeddings (MORE): A Hybrid Multi-Ontology and
Corpus-based Semantic Representation for Biomedical Concepts [0.5812284760539712]
This paper introduces Multi-Ontology Embeddings (MORE), a framework for incorporating domain knowledge from multiple ontologies into a distributional semantic model.
We use the RadCore and MIMIC-III free-text datasets for the corpus-based component of MORE.
For the corpus-based part, we use the Medical Subject Headings (MeSH) and three state-of-the-art-based similarity measures.
arXiv Detail & Related papers (2020-04-14T14:38:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.