Shuffle & Divide: Contrastive Learning for Long Text
- URL: http://arxiv.org/abs/2304.09374v1
- Date: Wed, 19 Apr 2023 02:02:29 GMT
- Title: Shuffle & Divide: Contrastive Learning for Long Text
- Authors: Joonseok Lee, Seongho Joe, Kyoungwon Park, Bogun Kim, Hoyoung Kang,
Jaeseon Park, Youngjune Gwon
- Abstract summary: We propose a self-supervised learning method for long text documents based on contrastive learning.
A key to our method is Shuffle and Divide (SaD), a simple text augmentation algorithm.
We have empirically evaluated our method by performing unsupervised text classification on the 20 Newsgroups, Reuters-21578, BBC, and BBCSport datasets.
- Score: 6.187839874846451
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a self-supervised learning method for long text documents based on
contrastive learning. A key to our method is Shuffle and Divide (SaD), a simple
text augmentation algorithm that sets up a pretext task required for
contrastive updates to BERT-based document embedding. SaD splits a document
into two sub-documents containing randomly shuffled words in the entire
documents. The sub-documents are considered positive examples, leaving all
other documents in the corpus as negatives. After SaD, we repeat the
contrastive update and clustering phases until convergence. It is naturally a
time-consuming, cumbersome task to label text documents, and our method can
help alleviate human efforts, which are most expensive resources in AI. We have
empirically evaluated our method by performing unsupervised text classification
on the 20 Newsgroups, Reuters-21578, BBC, and BBCSport datasets. In particular,
our method pushes the current state-of-the-art, SS-SB-MT, on 20 Newsgroups by
20.94% in accuracy. We also achieve the state-of-the-art performance on
Reuters-21578 and exceptionally-high accuracy performances (over 95%) for
unsupervised classification on the BBC and BBCSport datasets.
Related papers
- A Novel Dataset for Non-Destructive Inspection of Handwritten Documents [0.0]
Forensic handwriting examination aims to examine handwritten documents in order to properly define or hypothesize the manuscript's author.
We propose a new and challenging dataset consisting of two subsets: the first consists of 21 documents written either by the classic pen and paper" approach (and later digitized) and directly acquired on common devices such as tablets.
Preliminary results on the proposed datasets show that 90% classification accuracy can be achieved on the first subset.
arXiv Detail & Related papers (2024-01-09T09:25:58Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Summarization-based Data Augmentation for Document Classification [16.49709049899731]
We propose a simple yet effective summarization-based data augmentation, SUMMaug, for document classification.
We first obtain easy-to-learn examples for the target document classification task.
We then use the generated pseudo examples to perform curriculum learning.
arXiv Detail & Related papers (2023-12-01T11:34:37Z) - In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Pre-training for Abstractive Document Summarization by Reinstating
Source Text [105.77348528847337]
This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text.
Experiments on two benchmark summarization datasets show that all three objectives can improve performance upon baselines.
arXiv Detail & Related papers (2020-04-04T05:06:26Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.