Related papers: Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics

Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics

URL: http://arxiv.org/abs/2210.06023v1
Date: Wed, 12 Oct 2022 08:57:01 GMT
Title: Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics
Authors: Tim Schopf, Daniel Braun, Florian Matthes
Abstract summary: We introduce a method that learns jointly embedded document and word vectors solely from the unlabeled document dataset. The proposed method requires almost no text preprocessing but is simultaneously effective at retrieving relevant documents with high probability. For easy replication of our approach, we make the developed Lbl2Vec code publicly available as a ready-to-use tool under the 3-Clause BSD license.
Score: 0.6767885381740952
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we consider the task of retrieving documents with predefined topics from an unlabeled document dataset using an unsupervised approach. The proposed unsupervised approach requires only a small number of keywords describing the respective topics and no labeled document. Existing approaches either heavily relied on a large amount of additionally encoded world knowledge or on term-document frequencies. Contrariwise, we introduce a method that learns jointly embedded document and word vectors solely from the unlabeled document dataset in order to find documents that are semantically similar to the topics described by the keywords. The proposed method requires almost no text preprocessing but is simultaneously effective at retrieving relevant documents with high probability. When successively retrieving documents on different predefined topics from publicly available and commonly used datasets, we achieved an average area under the receiver operating characteristic curve value of 0.95 on one dataset and 0.92 on another. Further, our method can be used for multiclass document classification, without the need to assign labels to the dataset in advance. Compared with an unsupervised classification baseline, we increased F1 scores from 76.6 to 82.7 and from 61.0 to 75.1 on the respective datasets. For easy replication of our approach, we make the developed Lbl2Vec code publicly available as a ready-to-use tool under the 3-Clause BSD license.

Related papers

Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings. First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss. Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z)
Document Type Classification using File Names [7.130525292849283]
Rapid document classification is critical in several time-sensitive applications like digital forensics and large-scale media classification. Traditional approaches that rely on heavy-duty deep learning models fall short due to high inference times over vast input datasets. We present a method using lightweight supervised learning models, combined with a TF-IDF feature extraction-based tokenization method.
arXiv Detail & Related papers (2024-10-02T01:42:19Z)
Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format. We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing.
arXiv Detail & Related papers (2024-06-17T06:27:35Z)
Multimodal Tree Decoder for Table of Contents Extraction in Document Images [32.46909366312659]
Table of contents (ToC) extraction aims to extract headings of different levels in documents to better understand the outline of the contents. We first introduce a standard dataset, HierDoc, including image samples from 650 documents of scientific papers with their content labels. We propose a novel end-to-end model by using the multimodal tree decoder (MTD) for ToC as a benchmark for HierDoc.
arXiv Detail & Related papers (2022-12-06T11:38:31Z)
Improving Probabilistic Models in Text Classification via Active Learning [0.0]
We propose a fast new model for text classification that combines information from both labeled and unlabeled data with an active learning component. We show that by introducing information about the structure of unlabeled data and iteratively labeling uncertain documents, our model improves performance.
arXiv Detail & Related papers (2022-02-05T20:09:26Z)
Out-of-Category Document Identification Using Target-Category Names as Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories. We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z)
LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets. The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z)
MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information. We propose different ways to regularize the parameters and output probability of each child label by its parents. Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z)
Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level. We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z)
Document Network Projection in Pretrained Word Embedding Space [7.455546102930911]
We present Regularized Linear Embedding (RLE), a novel method that projects a collection of linked documents into a pretrained word embedding space. We leverage a matrix of pairwise similarities providing complementary information (e.g., the network proximity of two documents in a citation graph) The document representations can help to solve many information retrieval tasks, such as recommendation, classification and clustering.
arXiv Detail & Related papers (2020-01-16T10:16:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.