Shuffle & Divide: Contrastive Learning for Long Text
- URL: http://arxiv.org/abs/2304.09374v1
- Date: Wed, 19 Apr 2023 02:02:29 GMT
- Title: Shuffle & Divide: Contrastive Learning for Long Text
- Authors: Joonseok Lee, Seongho Joe, Kyoungwon Park, Bogun Kim, Hoyoung Kang,
Jaeseon Park, Youngjune Gwon
- Abstract summary: We propose a self-supervised learning method for long text documents based on contrastive learning.
A key to our method is Shuffle and Divide (SaD), a simple text augmentation algorithm.
We have empirically evaluated our method by performing unsupervised text classification on the 20 Newsgroups, Reuters-21578, BBC, and BBCSport datasets.
- Score: 6.187839874846451
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a self-supervised learning method for long text documents based on
contrastive learning. A key to our method is Shuffle and Divide (SaD), a simple
text augmentation algorithm that sets up a pretext task required for
contrastive updates to BERT-based document embedding. SaD splits a document
into two sub-documents containing randomly shuffled words in the entire
documents. The sub-documents are considered positive examples, leaving all
other documents in the corpus as negatives. After SaD, we repeat the
contrastive update and clustering phases until convergence. It is naturally a
time-consuming, cumbersome task to label text documents, and our method can
help alleviate human efforts, which are most expensive resources in AI. We have
empirically evaluated our method by performing unsupervised text classification
on the 20 Newsgroups, Reuters-21578, BBC, and BBCSport datasets. In particular,
our method pushes the current state-of-the-art, SS-SB-MT, on 20 Newsgroups by
20.94% in accuracy. We also achieve the state-of-the-art performance on
Reuters-21578 and exceptionally-high accuracy performances (over 95%) for
unsupervised classification on the BBC and BBCSport datasets.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format.
We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset.
In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing.
arXiv Detail & Related papers (2024-06-17T06:27:35Z) - In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.