Method for Customizable Automated Tagging: Addressing the Problem of
Over-tagging and Under-tagging Text Documents
- URL: http://arxiv.org/abs/2005.00042v1
- Date: Thu, 30 Apr 2020 18:28:42 GMT
- Title: Method for Customizable Automated Tagging: Addressing the Problem of
Over-tagging and Under-tagging Text Documents
- Authors: Maharshi R. Pandya, Jessica Reyes, Bob Vanderheyden
- Abstract summary: Using author provided tags to predict tags for a new document often results in the overgeneration of tags.
In this paper, we present a method to generate a universal set of tags that can be applied widely to a large document corpus.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Using author provided tags to predict tags for a new document often results
in the overgeneration of tags. In the case where the author doesn't provide any
tags, our documents face the severe under-tagging issue. In this paper, we
present a method to generate a universal set of tags that can be applied widely
to a large document corpus. Using IBM Watson's NLU service, first, we collect
keywords/phrases that we call "complex document tags" from 8,854 popular
reports in the corpus. We apply LDA model over these complex document tags to
generate a set of 765 unique "simple tags". In applying the tags to a corpus of
documents, we run each document through the IBM Watson NLU and apply
appropriate simple tags. Using only 765 simple tags, our method allows us to
tag 87,397 out of 88,583 total documents in the corpus with at least one tag.
About 92.1% of the total 87,397 documents are also determined to be
sufficiently-tagged. In the end, we discuss the performance of our method and
its limitations.
Related papers
- Magic Markup: Maintaining Document-External Markup with an LLM [1.0538052824177144]
We present a system that re-tags modified programs, enabling rich annotations to automatically follow code as it evolves.
Our system achieves an accuracy of 90% on our benchmarks and can replace a document's tags in parallel at a rate of 5 seconds per tag.
While there remains significant room for improvement, we find performance reliable enough to justify further exploration of applications.
arXiv Detail & Related papers (2024-03-06T05:40:31Z) - Weakly-Supervised Scientific Document Classification via
Retrieval-Augmented Multi-Stage Training [24.2734548438594]
We propose a weakly-supervised approach for scientific document classification using label names only.
In scientific domains, label names often include domain-specific concepts that may not appear in the document corpus.
We show that WANDER outperforms the best baseline by 11.9% on average.
arXiv Detail & Related papers (2023-06-12T15:50:13Z) - Document Layout Annotation: Database and Benchmark in the Domain of
Public Affairs [62.38140271294419]
We propose a procedure to semi-automatically annotate digital documents with different layout labels.
We collect a novel database for DLA in the public affairs domain using a set of 24 data sources from the Spanish Administration.
The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
arXiv Detail & Related papers (2023-06-12T08:21:50Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval
on Predefined Topics [0.6767885381740952]
We introduce a method that learns jointly embedded document and word vectors solely from the unlabeled document dataset.
The proposed method requires almost no text preprocessing but is simultaneously effective at retrieving relevant documents with high probability.
For easy replication of our approach, we make the developed Lbl2Vec code publicly available as a ready-to-use tool under the 3-Clause BSD license.
arXiv Detail & Related papers (2022-10-12T08:57:01Z) - Open Set Classification of Untranscribed Handwritten Documents [56.0167902098419]
Huge amounts of digital page images of important manuscripts are preserved in archives worldwide.
The class or typology'' of a document is perhaps the most important tag to be included in the metadata.
The technical problem is one of automatic classification of documents, each consisting of a set of untranscribed handwritten text images.
arXiv Detail & Related papers (2022-06-20T20:43:50Z) - Improving Probabilistic Models in Text Classification via Active
Learning [0.0]
We propose a fast new model for text classification that combines information from both labeled and unlabeled data with an active learning component.
We show that by introducing information about the structure of unlabeled data and iteratively labeling uncertain documents, our model improves performance.
arXiv Detail & Related papers (2022-02-05T20:09:26Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - SenTag: a Web-based Tool for Semantic Annotation of Textual Documents [4.910379177401659]
SenTag is a web-based tool focused on semantic annotation of textual documents.
The main goal of the application is two-fold: facilitating the tagging process and reducing or avoiding for errors in the output documents.
It is also possible to assess the level of agreement of annotators working on a corpus of text.
arXiv Detail & Related papers (2021-09-16T08:39:33Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level.
We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.