Pairwise Multi-Class Document Classification for Semantic Relations
between Wikipedia Articles
- URL: http://arxiv.org/abs/2003.09881v1
- Date: Sun, 22 Mar 2020 12:52:56 GMT
- Title: Pairwise Multi-Class Document Classification for Semantic Relations
between Wikipedia Articles
- Authors: Malte Ostendorff, Terry Ruas, Moritz Schubotz, Georg Rehm, Bela Gipp
- Abstract summary: We model the problem of finding the relationship between two documents as a pairwise document classification task.
To find semantic relation between documents, we apply a series of techniques, such as GloVe, paragraph-s, BERT, and XLNet.
We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations.
- Score: 5.40541521227338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many digital libraries recommend literature to their users considering the
similarity between a query document and their repository. However, they often
fail to distinguish what is the relationship that makes two documents alike. In
this paper, we model the problem of finding the relationship between two
documents as a pairwise document classification task. To find the semantic
relation between documents, we apply a series of techniques, such as GloVe,
Paragraph-Vectors, BERT, and XLNet under different configurations (e.g.,
sequence length, vector concatenation scheme), including a Siamese architecture
for the Transformer-based systems. We perform our experiments on a newly
proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that
define the semantic document relations. Our results show vanilla BERT as the
best performing system with an F1-score of 0.93, which we manually examine to
better understand its applicability to other domains. Our findings suggest that
classifying semantic relations between documents is a solvable task and
motivates the development of recommender systems based on the evaluated
techniques. The discussions in this paper serve as first steps in the
exploration of documents through SPARQL-like queries such that one could find
documents that are similar in one aspect but dissimilar in another.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$)
GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training.
Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Cross-Modal Entity Matching for Visually Rich Documents [4.8119678510491815]
Visually rich documents utilize visual cues to augment their semantics.
Existing works that enable structured querying on these documents do not take this into account.
We propose Juno -- a cross-modal entity matching framework to address this limitation.
arXiv Detail & Related papers (2023-03-01T18:26:14Z) - CAPSTONE: Curriculum Sampling for Dense Retrieval with Document
Expansion [68.19934563919192]
We propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query.
Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.
arXiv Detail & Related papers (2022-12-18T15:57:46Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - Specialized Document Embeddings for Aspect-based Similarity of Research
Papers [4.661692753666685]
We treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces.
We represent a document not as a single generic embedding but as multiple specialized embeddings.
Our approach mitigates potential risks arising from implicit biases by making them explicit.
arXiv Detail & Related papers (2022-03-28T07:35:26Z) - Aspect-based Document Similarity for Research Papers [4.661692753666685]
We extend similarity with aspect information by performing a pairwise document classification task.
We evaluate our aspect-based document similarity for research papers.
Our results show SciBERT as the best performing system.
arXiv Detail & Related papers (2020-10-13T13:51:21Z) - Document Network Projection in Pretrained Word Embedding Space [7.455546102930911]
We present Regularized Linear Embedding (RLE), a novel method that projects a collection of linked documents into a pretrained word embedding space.
We leverage a matrix of pairwise similarities providing complementary information (e.g., the network proximity of two documents in a citation graph)
The document representations can help to solve many information retrieval tasks, such as recommendation, classification and clustering.
arXiv Detail & Related papers (2020-01-16T10:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.