A reproducible experimental survey on biomedical sentence similarity: a
string-based method sets the state of the art
- URL: http://arxiv.org/abs/2205.08740v1
- Date: Wed, 18 May 2022 06:20:42 GMT
- Title: A reproducible experimental survey on biomedical sentence similarity: a
string-based method sets the state of the art
- Authors: Alicia Lara-Clares and Juan J. Lastra-D\'iaz and Ana Garcia-Serrano
- Abstract summary: This report introduces the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity.
Our aim is to elucidate the state of the art of the problem and to solve some problems preventing the evaluation of most of current methods.
Our experiments confirm that the pre-processing stages, and the choice of the NER tool, have a significant impact on the performance of the sentence similarity methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This registered report introduces the largest, and for the first time,
reproducible experimental survey on biomedical sentence similarity with the
following aims: (1) to elucidate the state of the art of the problem; (2) to
solve some reproducibility problems preventing the evaluation of most of
current methods; (3) to evaluate several unexplored sentence similarity
methods; (4) to evaluate an unexplored benchmark, called
Corpus-Transcriptional-Regulation; (5) to carry out a study on the impact of
the pre-processing stages and Named Entity Recognition (NER) tools on the
performance of the sentence similarity methods; and finally, (6) to bridge the
lack of reproducibility resources for methods and experiments in this line of
research. Our experimental survey is based on a single software platform that
is provided with a detailed reproducibility protocol and dataset as
supplementary material to allow the exact replication of all our experiments.
In addition, we introduce a new aggregated string-based sentence similarity
method, called LiBlock, together with eight variants of current ontology-based
methods and a new pre-trained word embedding model trained on the full-text
articles in the PMC-BioC corpus. Our experiments show that our novel
string-based measure sets the new state of the art on the sentence similarity
task in the biomedical domain and significantly outperforms all the methods
evaluated herein, except one ontology-based method. Likewise, our experiments
confirm that the pre-processing stages, and the choice of the NER tool, have a
significant impact on the performance of the sentence similarity methods. We
also detail some drawbacks and limitations of current methods, and warn on the
need of refining the current benchmarks. Finally, a noticeable finding is that
our new string-based method significantly outperforms all state-of-the-art
Machine Learning models evaluated herein.
Related papers
- Leak Proof CMap; a framework for training and evaluation of cell line agnostic L1000 similarity methods [0.0]
The Connectivity Map (CMap) is a large publicly available database of cellular transcriptomic responses to chemical and genetic perturbations.
We have developed 'Leak Proof CMap' and exemplified its application to a set of common transcriptomic and generic phenotypic similarity methods.
Benchmarking in three critical performance areas (compactness, distinctness, and uniqueness) is conducted using carefully crafted data splits.
This enables testing of models with unseen samples akin to exploring treatments with novel modes of action in novel patient derived cell lines.
arXiv Detail & Related papers (2024-04-29T04:11:39Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - Experimental Analysis of Large-scale Learnable Vector Storage
Compression [42.52474894105165]
Learnable embedding vector is one of the most important applications in machine learning.
The high dimensionality of sparse data in recommendation tasks and the huge volume of corpus in retrieval-related tasks lead to a large memory consumption of the embedding table.
Recent research has proposed various methods to compress the embeddings at the cost of a slight decrease in model quality or the introduction of other overheads.
arXiv Detail & Related papers (2023-11-27T07:11:47Z) - Benchmarking Bayesian Causal Discovery Methods for Downstream Treatment
Effect Estimation [137.3520153445413]
A notable gap exists in the evaluation of causal discovery methods, where insufficient emphasis is placed on downstream inference.
We evaluate seven established baseline causal discovery methods including a newly proposed method based on GFlowNets.
The results of our study demonstrate that some of the algorithms studied are able to effectively capture a wide range of useful and diverse ATE modes.
arXiv Detail & Related papers (2023-07-11T02:58:10Z) - On the role of benchmarking data sets and simulations in method
comparison studies [0.0]
This paper investigates differences and similarities between simulation studies and benchmarking studies.
We borrow ideas from different contexts such as mixed methods research and Clinical Scenario Evaluation.
arXiv Detail & Related papers (2022-08-02T13:47:53Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Active Learning-Based Multistage Sequential Decision-Making Model with
Application on Common Bile Duct Stone Evaluation [8.296821186083974]
Multistage sequential decision-making scenarios are commonly seen in the healthcare diagnosis process.
In this paper, an active learning-based method is developed to actively collect only the necessary patient data in a sequential manner.
The effectiveness of the proposed method is validated in both a simulation study and a real case study.
arXiv Detail & Related papers (2022-01-13T06:42:12Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - Neural sentence embedding models for semantic similarity estimation in
the biomedical domain [6.325814141416726]
We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset.
We evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts.
arXiv Detail & Related papers (2021-10-01T13:27:44Z) - On Sampling-Based Training Criteria for Neural Language Modeling [97.35284042981675]
We consider Monte Carlo sampling, importance sampling, a novel method we call compensated partial summation, and noise contrastive estimation.
We show that all these sampling methods can perform equally well, as long as we correct for the intended class posterior probabilities.
Experimental results in language modeling and automatic speech recognition on Switchboard and LibriSpeech support our claim.
arXiv Detail & Related papers (2021-04-21T12:55:52Z) - Generalization Bounds and Representation Learning for Estimation of
Potential Outcomes and Causal Effects [61.03579766573421]
We study estimation of individual-level causal effects, such as a single patient's response to alternative medication.
We devise representation learning algorithms that minimize our bound, by regularizing the representation's induced treatment group distance.
We extend these algorithms to simultaneously learn a weighted representation to further reduce treatment group distances.
arXiv Detail & Related papers (2020-01-21T10:16:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.