Unsupervised Document Embedding via Contrastive Augmentation
- URL: http://arxiv.org/abs/2103.14542v1
- Date: Fri, 26 Mar 2021 15:48:52 GMT
- Title: Unsupervised Document Embedding via Contrastive Augmentation
- Authors: Dongsheng Luo, Wei Cheng, Jingchao Ni, Wenchao Yu, Xuchao Zhang, Bo
Zong, Yanchi Liu, Zhengzhang Chen, Dongjin Song, Haifeng Chen, Xiang Zhang
- Abstract summary: We present a contrasting learning approach with data augmentation techniques to learn document representations in unsupervised manner.
Inspired by recent contrastive self-supervised learning algorithms used for image and pretraining, we hypothesize that high-quality document embedding should be invariant to diverse paraphrases.
Our method can decrease the classification error rate by up to 6.4% over the SOTA approaches on the document classification task, matching or even surpassing fully-supervised methods.
- Score: 48.71917352110245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a contrasting learning approach with data augmentation techniques
to learn document representations in an unsupervised manner. Inspired by recent
contrastive self-supervised learning algorithms used for image and NLP
pretraining, we hypothesize that high-quality document embedding should be
invariant to diverse paraphrases that preserve the semantics of the original
document. With different backbones and contrastive learning frameworks, our
study reveals the enormous benefits of contrastive augmentation for document
representation learning with two additional insights: 1) including data
augmentation in a contrastive way can substantially improve the embedding
quality in unsupervised document representation learning, and 2) in general,
stochastic augmentations generated by simple word-level manipulation work much
better than sentence-level and document-level ones. We plug our method into a
classifier and compare it with a broad range of baseline methods on six
benchmark datasets. Our method can decrease the classification error rate by up
to 6.4% over the SOTA approaches on the document classification task, matching
or even surpassing fully-supervised methods.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - DECDM: Document Enhancement using Cycle-Consistent Diffusion Models [3.3813766129849845]
We propose DECDM, an end-to-end document-level image translation method inspired by recent advances in diffusion models.
Our method overcomes the limitations of paired training by independently training the source (noisy input) and target (clean output) models.
We also introduce simple data augmentation strategies to improve character-glyph conservation during translation.
arXiv Detail & Related papers (2023-11-16T07:16:02Z) - Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - A Simplified Framework for Contrastive Learning for Node Representations [2.277447144331876]
We investigate the potential of deploying contrastive learning in combination with Graph Neural Networks for embedding nodes in a graph.
We show that the quality of the resulting embeddings and training time can be significantly improved by a simple column-wise postprocessing of the embedding matrix.
This modification yields improvements in downstream classification tasks of up to 1.5% and even beats existing state-of-the-art approaches on 6 out of 8 different benchmarks.
arXiv Detail & Related papers (2023-05-01T02:04:36Z) - Differentiable Data Augmentation for Contrastive Sentence Representation
Learning [6.398022050054328]
The proposed method yields significant improvements over existing methods under both semi-supervised and supervised settings.
Our experiments under a low labeled data setting also show that our method is more label-efficient than the state-of-the-art contrastive learning methods.
arXiv Detail & Related papers (2022-10-29T08:57:45Z) - Constructing Contrastive samples via Summarization for Text
Classification with limited annotations [46.53641181501143]
We propose a novel approach to constructing contrastive samples for language tasks using text summarization.
We use these samples for supervised contrastive learning to gain better text representations with limited annotations.
Experiments on real-world text classification datasets (Amazon-5, Yelp-5, AG News) demonstrate the effectiveness of the proposed contrastive learning framework.
arXiv Detail & Related papers (2021-04-11T20:13:24Z) - Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level.
We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.