Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs
for Semantic Sentence Embeddings
- URL: http://arxiv.org/abs/2110.02030v1
- Date: Tue, 5 Oct 2021 13:21:40 GMT
- Title: Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs
for Semantic Sentence Embeddings
- Authors: Marco Di Giovanni and Marco Brambilla
- Abstract summary: We propose a language-independent approach to build large datasets of pairs of informal texts weakly similar.
We exploit Twitter's intrinsic powerful signals of relatedness: replies and quotes of tweets.
Our model learns classical Semantic Textual Similarity, but also excels on tasks where pairs of sentences are not exact paraphrases.
- Score: 3.8073142980733
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Semantic sentence embeddings are usually supervisedly built minimizing
distances between pairs of embeddings of sentences labelled as semantically
similar by annotators. Since big labelled datasets are rare, in particular for
non-English languages, and expensive, recent studies focus on unsupervised
approaches that require not-paired input sentences. We instead propose a
language-independent approach to build large datasets of pairs of informal
texts weakly similar, without manual human effort, exploiting Twitter's
intrinsic powerful signals of relatedness: replies and quotes of tweets. We use
the collected pairs to train a Transformer model with triplet-like structures,
and we test the generated embeddings on Twitter NLP similarity tasks (PIT and
TURL) and STSb. We also introduce four new sentence ranking evaluation
benchmarks of informal texts, carefully extracted from the initial collections
of tweets, proving not only that our best model learns classical Semantic
Textual Similarity, but also excels on tasks where pairs of sentences are not
exact paraphrases. Ablation studies reveal how increasing the corpus size
influences positively the results, even at 2M samples, suggesting that bigger
collections of Tweets still do not contain redundant information about semantic
similarities.
Related papers
- Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison
Scaling of Texts with Large Language Models [3.9940425551415597]
Existing text scaling methods often require a large corpus, struggle with short texts, or require labeled data.
We develop a text scaling method that leverages the pattern recognition capabilities of generative large language models.
We demonstrate how combining substantive knowledge with LLMs can create state-of-the-art measures of abstract concepts.
arXiv Detail & Related papers (2023-10-18T15:34:37Z) - Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - Improving Sentence Similarity Estimation for Unsupervised Extractive
Summarization [21.602394765472386]
We propose two novel strategies to improve sentence similarity estimation for unsupervised extractive summarization.
We use contrastive learning to optimize a document-level objective that sentences from the same document are more similar than those from different documents.
We also use mutual learning to enhance the relationship between sentence similarity estimation and sentence salience ranking.
arXiv Detail & Related papers (2023-02-24T07:10:33Z) - Relational Sentence Embedding for Flexible Semantic Matching [86.21393054423355]
We present Sentence Embedding (RSE), a new paradigm to discover further the potential of sentence embeddings.
RSE is effective and flexible in modeling sentence relations and outperforms a series of state-of-the-art embedding methods.
arXiv Detail & Related papers (2022-12-17T05:25:17Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - A Case Study to Reveal if an Area of Interest has a Trend in Ongoing
Tweets Using Word and Sentence Embeddings [0.0]
We have proposed an easily applicable automated methodology in which the Daily Mean Similarity Scores show the similarity between the daily tweet corpus and the target words.
The Daily Mean Similarity Scores have mainly based on cosine similarity and word/sentence embeddings.
We have also compared the effectiveness of using word versus sentence embeddings while applying our methodology and realized that both give almost the same results.
arXiv Detail & Related papers (2021-10-02T18:44:55Z) - Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to
Corpus Exploration [25.159601117722936]
We propose a contrastive fine-tuning objective that enables BERT to produce more powerful phrase embeddings.
Our approach relies on a dataset of diverse phrasal paraphrases, which is automatically generated using a paraphrase generation model.
As a case study, we show that Phrase-BERT embeddings can be easily integrated with a simple autoencoder to build a phrase-based neural topic model.
arXiv Detail & Related papers (2021-09-13T20:31:57Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.