Related papers: Specialized Document Embeddings for Aspect-based Similarity of Research Papers

Specialized Document Embeddings for Aspect-based Similarity of Research Papers

URL: http://arxiv.org/abs/2203.14541v1
Date: Mon, 28 Mar 2022 07:35:26 GMT
Title: Specialized Document Embeddings for Aspect-based Similarity of Research Papers
Authors: Malte Ostendorff, Till Blume, Terry Ruas, Bela Gipp, Georg Rehm
Abstract summary: We treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces. We represent a document not as a single generic embedding but as multiple specialized embeddings. Our approach mitigates potential risks arising from implicit biases by making them explicit.
Score: 4.661692753666685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been developed using document segmentation or pairwise multi-class document classification. While segmentation harms the document coherence, the pairwise classification approach scales poorly to large scale corpora. In this paper, we treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces. We represent a document not as a single generic embedding but as multiple specialized embeddings. Our approach avoids document segmentation and scales linearly w.r.t.the corpus size. In an empirical study, we use the Papers with Code corpus containing 157,606 research papers and consider the task, method, and dataset of the respective research papers as their aspects. We compare and analyze three generic document embeddings, six specialized document embeddings and a pairwise classification baseline in the context of research paper recommendations. As generic document embeddings, we consider FastText, SciBERT, and SPECTER. To compute the specialized document embeddings, we compare three alternative methods inspired by retrofitting, fine-tuning, and Siamese networks. In our experiments, Siamese SciBERT achieved the highest scores. Additional analyses indicate an implicit bias of the generic document embeddings towards the dataset aspect and against the method aspect of each research paper. Our approach of aspect-based document embeddings mitigates potential risks arising from implicit biases by making them explicit.

Related papers

PRISM: Fine-Grained Paper-to-Paper Retrieval with Multi-Aspect-Aware Query Optimization [61.783280234747394]
PRISM is a document-to-document retrieval method that introduces multiple, fine-grained representations for both the query and candidate papers.<n>We present SciFullBench, a novel benchmark in which the complete and segmented context of full papers for both queries and candidates is available.<n>Experiments show that PRISM improves performance by an average of 4.3% over existing retrieval baselines.
arXiv Detail & Related papers (2025-07-14T08:41:53Z)
The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure [91.01653854955286]
Embedding-based similarity metrics can be influenced by spurious attributes like the text's source or language.<n>This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost.
arXiv Detail & Related papers (2025-07-01T23:17:12Z)
Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings. First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss. Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z)
Document Provenance and Authentication through Authorship Classification [5.2545206693029884]
We propose an ensemble-based text-processing framework for the classification of single and multi-authored documents. The proposed framework incorporates several state-of-the-art text classification algorithms. The framework is evaluated on a large-scale benchmark dataset.
arXiv Detail & Related papers (2023-03-02T12:26:03Z)
Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings [46.324584649014284]
A topic model is often formulated as a generative model that explains how each word of a document is generated given a set of topics and document-specific topic proportions. This paper introduces a new topic-modeling framework where each document is viewed as a set of word embedding vectors and each topic is modeled as an embedding vector in the same embedding space. Embedding the words and topics in the same vector space, we define a method to measure the semantic difference between the embedding vectors of the words of a document and these of the topics, and optimize the topic embeddings to minimize the expected difference over all documents.
arXiv Detail & Related papers (2022-03-03T08:46:23Z)
Coherence-Based Distributed Document Representation Learning for Scientific Documents [9.646001537050925]
We propose a coupled text pair embedding (CTPE) model to learn the representation of scientific documents. We use negative sampling to construct uncoupled text pairs whose two parts are from different documents. We train the model to judge whether the text pair is coupled or uncoupled and use the obtained embedding of coupled text pairs as the embedding of documents.
arXiv Detail & Related papers (2022-01-08T15:29:21Z)
Out-of-Category Document Identification Using Target-Category Names as Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories. We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z)
Weakly-Supervised Aspect-Based Sentiment Analysis via Joint Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis. We learn sentiment, aspect> joint topic embeddings in the word embedding space. We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z)
Aspect-based Document Similarity for Research Papers [4.661692753666685]
We extend similarity with aspect information by performing a pairwise document classification task. We evaluate our aspect-based document similarity for research papers. Our results show SciBERT as the best performing system.
arXiv Detail & Related papers (2020-10-13T13:51:21Z)
Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level. We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z)
Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised Deep Asymmetric Metric Learning [62.34197797857823]
A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds. This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly. Our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds.
arXiv Detail & Related papers (2020-03-23T03:22:06Z)
Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles [5.40541521227338]
We model the problem of finding the relationship between two documents as a pairwise document classification task. To find semantic relation between documents, we apply a series of techniques, such as GloVe, paragraph-s, BERT, and XLNet. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations.
arXiv Detail & Related papers (2020-03-22T12:52:56Z)
Document Network Projection in Pretrained Word Embedding Space [7.455546102930911]
We present Regularized Linear Embedding (RLE), a novel method that projects a collection of linked documents into a pretrained word embedding space. We leverage a matrix of pairwise similarities providing complementary information (e.g., the network proximity of two documents in a citation graph) The document representations can help to solve many information retrieval tasks, such as recommendation, classification and clustering.
arXiv Detail & Related papers (2020-01-16T10:16:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.