Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to
Corpus Exploration
- URL: http://arxiv.org/abs/2109.06304v1
- Date: Mon, 13 Sep 2021 20:31:57 GMT
- Title: Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to
Corpus Exploration
- Authors: Shufan Wang and Laure Thompson and Mohit Iyyer
- Abstract summary: We propose a contrastive fine-tuning objective that enables BERT to produce more powerful phrase embeddings.
Our approach relies on a dataset of diverse phrasal paraphrases, which is automatically generated using a paraphrase generation model.
As a case study, we show that Phrase-BERT embeddings can be easily integrated with a simple autoencoder to build a phrase-based neural topic model.
- Score: 25.159601117722936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Phrase representations derived from BERT often do not exhibit complex phrasal
compositionality, as the model relies instead on lexical similarity to
determine semantic relatedness. In this paper, we propose a contrastive
fine-tuning objective that enables BERT to produce more powerful phrase
embeddings. Our approach (Phrase-BERT) relies on a dataset of diverse phrasal
paraphrases, which is automatically generated using a paraphrase generation
model, as well as a large-scale dataset of phrases in context mined from the
Books3 corpus. Phrase-BERT outperforms baselines across a variety of
phrase-level similarity tasks, while also demonstrating increased lexical
diversity between nearest neighbors in the vector space. Finally, as a case
study, we show that Phrase-BERT embeddings can be easily integrated with a
simple autoencoder to build a phrase-based neural topic model that interprets
topics as mixtures of words and phrases by performing a nearest neighbor search
in the embedding space. Crowdsourced evaluations demonstrate that this
phrase-based topic model produces more coherent and meaningful topics than
baseline word and phrase-level topic models, further validating the utility of
Phrase-BERT.
Related papers
- CAST: Corpus-Aware Self-similarity Enhanced Topic modelling [16.562349140796115]
We introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method.
We find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words.
Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data.
arXiv Detail & Related papers (2024-10-19T15:27:11Z) - Neural paraphrasing by automatically crawled and aligned sentence pairs [11.95795974003684]
The main obstacle toward neural-network-based paraphrasing is the lack of large datasets with aligned pairs of sentences and paraphrases.
We present a method for the automatic generation of large aligned corpora, that is based on the assumption that news and blog websites talk about the same events using different narrative styles.
We propose a similarity search procedure with linguistic constraints that, given a reference sentence, is able to locate the most similar candidate paraphrases out from millions of indexed sentences.
arXiv Detail & Related papers (2024-02-16T10:40:38Z) - ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR
Back-Translation [59.91139600152296]
ParaAMR is a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation.
We show that ParaAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning.
arXiv Detail & Related papers (2023-05-26T02:27:33Z) - Approximate Nearest Neighbour Phrase Mining for Contextual Speech
Recognition [5.54562323810107]
We train end-to-end Context-Aware Transformer Transducer ( CATT) models by using a simple, yet efficient method of mining hard negative phrases from the latent space of the context encoder.
During training, given a reference query, we mine a number of similar phrases using approximate nearest neighbour search.
These sampled phrases are then used as negative examples in the context list alongside random and ground truth contextual information.
arXiv Detail & Related papers (2023-04-18T09:52:11Z) - Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Relational Sentence Embedding for Flexible Semantic Matching [86.21393054423355]
We present Sentence Embedding (RSE), a new paradigm to discover further the potential of sentence embeddings.
RSE is effective and flexible in modeling sentence relations and outperforms a series of state-of-the-art embedding methods.
arXiv Detail & Related papers (2022-12-17T05:25:17Z) - Keywords and Instances: A Hierarchical Contrastive Learning Framework
Unifying Hybrid Granularities for Text Generation [59.01297461453444]
We propose a hierarchical contrastive learning mechanism, which can unify hybrid granularities semantic meaning in the input text.
Experiments demonstrate that our model outperforms competitive baselines on paraphrasing, dialogue generation, and storytelling tasks.
arXiv Detail & Related papers (2022-05-26T13:26:03Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.