Aspect-based Document Similarity for Research Papers
- URL: http://arxiv.org/abs/2010.06395v1
- Date: Tue, 13 Oct 2020 13:51:21 GMT
- Title: Aspect-based Document Similarity for Research Papers
- Authors: Malte Ostendorff, Terry Ruas, Till Blume, Bela Gipp, Georg Rehm
- Abstract summary: We extend similarity with aspect information by performing a pairwise document classification task.
We evaluate our aspect-based document similarity for research papers.
Our results show SciBERT as the best performing system.
- Score: 4.661692753666685
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional document similarity measures provide a coarse-grained distinction
between similar and dissimilar documents. Typically, they do not consider in
what aspects two documents are similar. This limits the granularity of
applications like recommender systems that rely on document similarity. In this
paper, we extend similarity with aspect information by performing a pairwise
document classification task. We evaluate our aspect-based document similarity
for research papers. Paper citations indicate the aspect-based similarity,
i.e., the section title in which a citation occurs acts as a label for the pair
of citing and cited paper. We apply a series of Transformer models such as
RoBERTa, ELECTRA, XLNet, and BERT variations and compare them to an LSTM
baseline. We perform our experiments on two newly constructed datasets of
172,073 research paper pairs from the ACL Anthology and CORD-19 corpus. Our
results show SciBERT as the best performing system. A qualitative examination
validates our quantitative results. Our findings motivate future research of
aspect-based document similarity and the development of a recommender system
based on the evaluated techniques. We make our datasets, code, and trained
models publicly available.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - CausalCite: A Causal Formulation of Paper Citations [80.82622421055734]
CausalCite is a new way to measure the significance of a paper by assessing the causal impact of the paper on its follow-up papers.
It is based on a novel causal inference method, TextMatch, which adapts the traditional matching framework to high-dimensional text embeddings.
We demonstrate the effectiveness of CausalCite on various criteria, such as high correlation with paper impact as reported by scientific experts.
arXiv Detail & Related papers (2023-11-05T23:09:39Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - Document-Level Relation Extraction with Sentences Importance Estimation
and Focusing [52.069206266557266]
Document-level relation extraction (DocRE) aims to determine the relation between two entities from a document of multiple sentences.
We propose a Sentence Estimation and Focusing (SIEF) framework for DocRE, where we design a sentence importance score and a sentence focusing loss.
Experimental results on two domains show that our SIEF not only improves overall performance, but also makes DocRE models more robust.
arXiv Detail & Related papers (2022-04-27T03:20:07Z) - Specialized Document Embeddings for Aspect-based Similarity of Research
Papers [4.661692753666685]
We treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces.
We represent a document not as a single generic embedding but as multiple specialized embeddings.
Our approach mitigates potential risks arising from implicit biases by making them explicit.
arXiv Detail & Related papers (2022-03-28T07:35:26Z) - Multi-Vector Models with Textual Guidance for Fine-Grained Scientific
Document Similarity [11.157086694203201]
We present a new scientific document similarity model based on matching fine-grained aspects.
Our model is trained using co-citation contexts that describe related paper aspects as a novel form of textual supervision.
arXiv Detail & Related papers (2021-11-16T11:12:30Z) - Eider: Evidence-enhanced Document-level Relation Extraction [56.71004595444816]
Document-level relation extraction (DocRE) aims at extracting semantic relations among entity pairs in a document.
We propose a three-stage evidence-enhanced DocRE framework consisting of joint relation and evidence extraction, evidence-centered relation extraction (RE), and fusion of extraction results.
arXiv Detail & Related papers (2021-06-16T09:43:16Z) - Self-Supervised Document Similarity Ranking via Contextualized Language
Models and Hierarchical Inference [21.232963704793143]
We introduce SDR, a self-supervised method for document similarity that can be applied to documents of arbitrary length.
SDR can be effectively applied to extremely long documents, exceeding the 4,096 maximal token limits of Longformer.
We publish two human-annotated test sets of long documents similarity evaluation.
arXiv Detail & Related papers (2021-06-02T14:29:35Z) - Methods for Computing Legal Document Similarity: A Comparative Study [9.007583099505954]
Finding similar legal documents is an important and challenging task in the domain of Legal Information Retrieval.
We propose two broad ways of measuring similarity between legal documents - analyzing the precedent citation network, and measuring similarity based on textual content similarity measures.
We explore two promising new similarity computation methods - one text-based and the other based on network embeddings, which have not been considered till now.
arXiv Detail & Related papers (2020-04-26T08:26:04Z) - Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised
Deep Asymmetric Metric Learning [62.34197797857823]
A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds.
This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly.
Our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds.
arXiv Detail & Related papers (2020-03-23T03:22:06Z) - Pairwise Multi-Class Document Classification for Semantic Relations
between Wikipedia Articles [5.40541521227338]
We model the problem of finding the relationship between two documents as a pairwise document classification task.
To find semantic relation between documents, we apply a series of techniques, such as GloVe, paragraph-s, BERT, and XLNet.
We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations.
arXiv Detail & Related papers (2020-03-22T12:52:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.