Unsupervised Sentence Representation Learning with Frequency-induced
Adversarial Tuning and Incomplete Sentence Filtering
- URL: http://arxiv.org/abs/2305.08655v1
- Date: Mon, 15 May 2023 13:59:23 GMT
- Title: Unsupervised Sentence Representation Learning with Frequency-induced
Adversarial Tuning and Incomplete Sentence Filtering
- Authors: Bing Wang, Ximing Li, Zhiyao Yang, Yuanyuan Guan, Jiayin Li,
Shengsheng Wang
- Abstract summary: We propose Sentence Representation Learning with Frequency-induced Adversarial tuning and Incomplete sentence filtering (SLT-FAI)
PLM is sensitive to the frequency information of words from their pre-training corpora, resulting in anisotropic embedding space.
We incorporate an information discriminator to distinguish the embeddings of original sentences and incomplete sentences by randomly masking several low-frequency words.
- Score: 14.085826003974187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained Language Model (PLM) is nowadays the mainstay of Unsupervised
Sentence Representation Learning (USRL). However, PLMs are sensitive to the
frequency information of words from their pre-training corpora, resulting in
anisotropic embedding space, where the embeddings of high-frequency words are
clustered but those of low-frequency words disperse sparsely. This anisotropic
phenomenon results in two problems of similarity bias and information bias,
lowering the quality of sentence embeddings. To solve the problems, we
fine-tune PLMs by leveraging the frequency information of words and propose a
novel USRL framework, namely Sentence Representation Learning with
Frequency-induced Adversarial tuning and Incomplete sentence filtering
(SLT-FAI). We calculate the word frequencies over the pre-training corpora of
PLMs and assign words thresholding frequency labels. With them, (1) we
incorporate a similarity discriminator used to distinguish the embeddings of
high-frequency and low-frequency words, and adversarially tune the PLM with it,
enabling to achieve uniformly frequency-invariant embedding space; and (2) we
propose a novel incomplete sentence detection task, where we incorporate an
information discriminator to distinguish the embeddings of original sentences
and incomplete sentences by randomly masking several low-frequency words,
enabling to emphasize the more informative low-frequency words. Our SLT-FAI is
a flexible and plug-and-play framework, and it can be integrated with existing
USRL techniques. We evaluate SLT-FAI with various backbones on benchmark
datasets. Empirical results indicate that SLT-FAI can be superior to the
existing USRL baselines. Our code is released in
\url{https://github.com/wangbing1416/SLT-FAI}.
Related papers
- Zipfian Whitening [7.927385005964994]
Most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform.
In reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law.
We show that simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance.
arXiv Detail & Related papers (2024-11-01T15:40:19Z) - On the Noise Robustness of In-Context Learning for Text Generation [41.59602454113563]
In this work, we show that, on text generation tasks, noisy annotations significantly hurt the performance of in-context learning.
To circumvent the issue, we propose a simple and effective approach called Local Perplexity Ranking (LPR)
LPR replaces the "noisy" candidates with their nearest neighbors that are more likely to be clean.
arXiv Detail & Related papers (2024-05-27T15:22:58Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Frequency effects in Linear Discriminative Learning [0.36248657646376703]
We show how an efficient, yet frequency-informed mapping between form and meaning can be obtained (Frequency-informed learning; FIL)
FIL shows a relatively low type- and high token-accuracy, demonstrating that the model is able to process most word tokens encountered by speakers in daily life correctly.
Our results show how frequency effects in a learning model can be simulated efficiently, and raise questions about how to best account for low-frequency words in cognitive models.
arXiv Detail & Related papers (2023-06-19T16:15:46Z) - Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue.
Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
arXiv Detail & Related papers (2023-05-09T11:00:02Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - Refined Semantic Enhancement towards Frequency Diffusion for Video
Captioning [29.617527535279574]
Video captioning aims to generate natural language sentences that describe the given video accurately.
Existing methods obtain favorable generation by exploring richer visual representations in encode phase or improving the decoding ability.
We introduce a novel Refined Semantic enhancement method towards Frequency Diffusion (RSFD), a captioning model that constantly perceives the linguistic representation of the infrequent tokens.
arXiv Detail & Related papers (2022-11-28T05:45:17Z) - ADEPT: A DEbiasing PrompT Framework [49.582497203415855]
Finetuning is an applicable approach for debiasing contextualized word embeddings.
discrete prompts with semantic meanings have shown to be effective in debiasing tasks.
We propose ADEPT, a method to debias PLMs using prompt tuning while maintaining the delicate balance between removing biases and ensuring representation ability.
arXiv Detail & Related papers (2022-11-10T08:41:40Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Frequency-Aware Contrastive Learning for Neural Machine Translation [24.336356651877388]
Low-frequency word prediction remains a challenge in modern neural machine translation (NMT) systems.
Inspired by the observation that low-frequency words form a more compact embedding space, we tackle this challenge from a representation learning perspective.
We propose a frequency-aware token-level contrastive learning method, in which the hidden state of each decoding step is pushed away from the counterparts of other target words.
arXiv Detail & Related papers (2021-12-29T10:10:10Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.