Comparative analysis of word embeddings in assessing semantic similarity
of complex sentences
- URL: http://arxiv.org/abs/2010.12637v3
- Date: Fri, 9 Jul 2021 21:15:24 GMT
- Title: Comparative analysis of word embeddings in assessing semantic similarity
of complex sentences
- Authors: Dhivya Chandrasekaran and Vijay Mago
- Abstract summary: We study the sentences in existing benchmark datasets and analyze the sensitivity of various word embeddings with respect to the complexity of the sentences.
The results show the increase in complexity of the sentences has a significant impact on the performance of the embedding models.
- Score: 8.873705500708196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic textual similarity is one of the open research challenges in the
field of Natural Language Processing. Extensive research has been carried out
in this field and near-perfect results are achieved by recent transformer-based
models in existing benchmark datasets like the STS dataset and the SICK
dataset. In this paper, we study the sentences in these datasets and analyze
the sensitivity of various word embeddings with respect to the complexity of
the sentences. We build a complex sentences dataset comprising of 50 sentence
pairs with associated semantic similarity values provided by 15 human
annotators. Readability analysis is performed to highlight the increase in
complexity of the sentences in the existing benchmark datasets and those in the
proposed dataset. Further, we perform a comparative analysis of the performance
of various word embeddings and language models on the existing benchmark
datasets and the proposed dataset. The results show the increase in complexity
of the sentences has a significant impact on the performance of the embedding
models resulting in a 10-20% decrease in Pearson's and Spearman's correlation.
Related papers
- RepMatch: Quantifying Cross-Instance Similarities in Representation Space [15.215985417763472]
We introduce RepMatch, a novel method that characterizes data through the lens of similarity.
RepMatch quantifies the similarity between subsets of training instances by comparing the knowledge encoded in models trained on them.
We validate the effectiveness of RepMatch across multiple NLP tasks, datasets, and models.
arXiv Detail & Related papers (2024-10-12T20:42:28Z) - Revisiting the Phenomenon of Syntactic Complexity Convergence on German Dialogue Data [2.7038841665524846]
We revisit the phenomenon of syntactic complexity convergence in conversational interaction, originally found for English dialogue.
We use a modified metric to quantify syntactic complexity based on dependency parsing.
arXiv Detail & Related papers (2024-08-22T07:49:41Z) - Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback.
Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z) - How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored.
Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges.
We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - Domain Adaptation in Multilingual and Multi-Domain Monolingual Settings
for Complex Word Identification [0.27998963147546146]
Complex word identification (CWI) is a cornerstone process towards proper text simplification.
CWI is highly dependent on context, whereas its difficulty is augmented by the scarcity of available datasets.
We propose a novel training technique for the CWI task based on domain adaptation to improve the target character and context representations.
arXiv Detail & Related papers (2022-05-15T13:21:02Z) - Structurally Diverse Sampling Reduces Spurious Correlations in Semantic
Parsing Datasets [51.095144091781734]
We propose a novel algorithm for sampling a structurally diverse set of instances from a labeled instance pool with structured outputs.
We show that our algorithm performs competitively with or better than prior algorithms in not only compositional template splits but also traditional IID splits.
In general, we find that diverse train sets lead to better generalization than random training sets of the same size in 9 out of 10 dataset-split pairs.
arXiv Detail & Related papers (2022-03-16T07:41:27Z) - What Makes Sentences Semantically Related: A Textual Relatedness Dataset
and Empirical Study [31.062129406113588]
We introduce a dataset for Semantic Textual Relatedness, STR-2022, that has 5,500 English sentence pairs manually annotated.
We show that human intuition regarding relatedness of sentence pairs is highly reliable, with a repeat annotation correlation of 0.84.
We also show the utility of STR-2022 for evaluating automatic methods of sentence representation and for various downstream NLP tasks.
arXiv Detail & Related papers (2021-10-10T16:23:54Z) - Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack
Exchange Data [3.06261471569622]
SEDE is a dataset with 12,023 pairs of utterances andsql queries collected from real usage on the Stack Exchange website.
We show that these pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset.
arXiv Detail & Related papers (2021-06-09T12:09:51Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.