Semantic similarity estimation for domain specific data using BERT and other techniques
- URL: http://arxiv.org/abs/2506.18602v1
- Date: Mon, 23 Jun 2025 13:03:59 GMT
- Title: Semantic similarity estimation for domain specific data using BERT and other techniques
- Authors: R. Prashanth,
- Abstract summary: estimation of semantic similarity is an important research problem both in natural language processing and the natural language understanding.<n>We use two question pairs datasets for the analysis, one is a domain specific in-house dataset and the other is a public dataset.<n>We observe that the BERT model gave much superior performance as compared to the other methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Estimation of semantic similarity is an important research problem both in natural language processing and the natural language understanding, and that has tremendous application on various downstream tasks such as question answering, semantic search, information retrieval, document clustering, word-sense disambiguation and machine translation. In this work, we carry out the estimation of semantic similarity using different state-of-the-art techniques including the USE (Universal Sentence Encoder), InferSent and the most recent BERT, or Bidirectional Encoder Representations from Transformers, models. We use two question pairs datasets for the analysis, one is a domain specific in-house dataset and the other is a public dataset which is the Quora's question pairs dataset. We observe that the BERT model gave much superior performance as compared to the other methods. This should be because of the fine-tuning procedure that is involved in its training process, allowing it to learn patterns based on the training data that is used. This works demonstrates the applicability of BERT on domain specific datasets. We infer from the analysis that BERT is the best technique to use in the case of domain specific data.
Related papers
- Seed-Guided Fine-Grained Entity Typing in Science and Engineering
Domains [51.02035914828596]
We study the task of seed-guided fine-grained entity typing in science and engineering domains.
We propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus.
It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types.
arXiv Detail & Related papers (2024-01-23T22:36:03Z) - Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings [3.944219308229571]
In Natural Language Processing (NLP), Machine Reading (MRC) is the task of answering a question based on a given context.<n>To handle questions in the medical domain, modern language models such as BioBERT, SciBERT and even ChatGPT are trained on vast amounts of in-domain medical corpora.<n>We propose a resource-efficient approach for injecting domain knowledge into a model without relying on such domain-specific pre-training.
arXiv Detail & Related papers (2024-01-15T21:43:46Z) - A deep Natural Language Inference predictor without language-specific
training data [44.26507854087991]
We present a technique of NLP to tackle the problem of inference relation (NLI) between pairs of sentences in a target language of choice without a language-specific training dataset.
We exploit a generic translation dataset, manually translated, along with two instances of the same pre-trained model.
The model has been evaluated over machine translated Stanford NLI test dataset, machine translated Multi-Genre NLI test dataset, and manually translated RTE3-ITA test dataset.
arXiv Detail & Related papers (2023-09-06T10:20:59Z) - Memorization of Named Entities in Fine-tuned BERT Models [2.7623977033962936]
We investigate the extent of named entity memorization in fine-tuned BERT models.<n>We show that a fine-tuned BERT does not generate more named entities specific to the fine-tuning dataset than a BERT model that is pre-trained only.
arXiv Detail & Related papers (2022-12-07T16:20:50Z) - ASDOT: Any-Shot Data-to-Text Generation with Pretrained Language Models [82.63962107729994]
Any-Shot Data-to-Text (ASDOT) is a new approach flexibly applicable to diverse settings.
It consists of two steps, data disambiguation and sentence fusion.
Experimental results show that ASDOT consistently achieves significant improvement over baselines.
arXiv Detail & Related papers (2022-10-09T19:17:43Z) - An Exploratory Study on Utilising the Web of Linked Data for Product
Data Mining [3.7376948366228175]
This work focuses on the e-commerce domain to explore methods of utilising structured data to create language resources that may be used for product classification and linking.
We process billions of structured data points in the form of RDF n-quads, to create multi-million words of product-related corpora that are later used in three different ways for creating of language resources.
Our evaluation on an extensive set of benchmarks shows word embeddings to be the most reliable and consistent method to improve the accuracy on both tasks.
arXiv Detail & Related papers (2021-09-03T09:58:36Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for
Unsupervised Sentence Embedding Learning [53.32740707197856]
We present a new state-of-the-art unsupervised method based on pre-trained Transformers and Sequential Denoising Auto-Encoder (TSDAE)
It can achieve up to 93.1% of the performance of in-domain supervised approaches.
arXiv Detail & Related papers (2021-04-14T17:02:18Z) - Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance.
Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand.
We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z) - Multi-Task Attentive Residual Networks for Argument Mining [14.62200869391189]
We propose a residual architecture that exploits attention, multi-task learning, and makes use of ensemble.
We present an experimental evaluation on five different corpora of user-generated comments, scientific publications, and persuasive essays.
arXiv Detail & Related papers (2021-02-24T11:35:28Z) - Fine-Tuning BERT for Sentiment Analysis of Vietnamese Reviews [0.0]
Experimental results on two datasets show thatmodels using BERT slightly outperform other models usingGloVe and FastText.
Our proposed BERT fine-tuning method produces amodel with better performance than the original BERT fine-tuning method.
arXiv Detail & Related papers (2020-11-20T14:45:46Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - IDDA: a large-scale multi-domain dataset for autonomous driving [16.101248613062292]
This paper contributes a new large scale, synthetic dataset for semantic segmentation with more than 100 different source visual domains.
The dataset has been created to explicitly address the challenges of domain shift between training and test data in various weather and view point conditions.
arXiv Detail & Related papers (2020-04-17T15:22:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.