Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval
on Predefined Topics
- URL: http://arxiv.org/abs/2210.06023v1
- Date: Wed, 12 Oct 2022 08:57:01 GMT
- Title: Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval
on Predefined Topics
- Authors: Tim Schopf, Daniel Braun, Florian Matthes
- Abstract summary: We introduce a method that learns jointly embedded document and word vectors solely from the unlabeled document dataset.
The proposed method requires almost no text preprocessing but is simultaneously effective at retrieving relevant documents with high probability.
For easy replication of our approach, we make the developed Lbl2Vec code publicly available as a ready-to-use tool under the 3-Clause BSD license.
- Score: 0.6767885381740952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we consider the task of retrieving documents with predefined
topics from an unlabeled document dataset using an unsupervised approach. The
proposed unsupervised approach requires only a small number of keywords
describing the respective topics and no labeled document. Existing approaches
either heavily relied on a large amount of additionally encoded world knowledge
or on term-document frequencies. Contrariwise, we introduce a method that
learns jointly embedded document and word vectors solely from the unlabeled
document dataset in order to find documents that are semantically similar to
the topics described by the keywords. The proposed method requires almost no
text preprocessing but is simultaneously effective at retrieving relevant
documents with high probability. When successively retrieving documents on
different predefined topics from publicly available and commonly used datasets,
we achieved an average area under the receiver operating characteristic curve
value of 0.95 on one dataset and 0.92 on another. Further, our method can be
used for multiclass document classification, without the need to assign labels
to the dataset in advance. Compared with an unsupervised classification
baseline, we increased F1 scores from 76.6 to 82.7 and from 61.0 to 75.1 on the
respective datasets. For easy replication of our approach, we make the
developed Lbl2Vec code publicly available as a ready-to-use tool under the
3-Clause BSD license.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Document Type Classification using File Names [7.130525292849283]
Rapid document classification is critical in several time-sensitive applications like digital forensics and large-scale media classification.
Traditional approaches that rely on heavy-duty deep learning models fall short due to high inference times over vast input datasets.
We present a method using lightweight supervised learning models, combined with a TF-IDF feature extraction-based tokenization method.
arXiv Detail & Related papers (2024-10-02T01:42:19Z) - In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - Multimodal Tree Decoder for Table of Contents Extraction in Document
Images [32.46909366312659]
Table of contents (ToC) extraction aims to extract headings of different levels in documents to better understand the outline of the contents.
We first introduce a standard dataset, HierDoc, including image samples from 650 documents of scientific papers with their content labels.
We propose a novel end-to-end model by using the multimodal tree decoder (MTD) for ToC as a benchmark for HierDoc.
arXiv Detail & Related papers (2022-12-06T11:38:31Z) - Improving Probabilistic Models in Text Classification via Active
Learning [0.0]
We propose a fast new model for text classification that combines information from both labeled and unlabeled data with an active learning component.
We show that by introducing information about the structure of unlabeled data and iteratively labeling uncertain documents, our model improves performance.
arXiv Detail & Related papers (2022-02-05T20:09:26Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level.
We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z) - Document Network Projection in Pretrained Word Embedding Space [7.455546102930911]
We present Regularized Linear Embedding (RLE), a novel method that projects a collection of linked documents into a pretrained word embedding space.
We leverage a matrix of pairwise similarities providing complementary information (e.g., the network proximity of two documents in a citation graph)
The document representations can help to solve many information retrieval tasks, such as recommendation, classification and clustering.
arXiv Detail & Related papers (2020-01-16T10:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.