SenTag: a Web-based Tool for Semantic Annotation of Textual Documents
- URL: http://arxiv.org/abs/2110.15062v1
- Date: Thu, 16 Sep 2021 08:39:33 GMT
- Title: SenTag: a Web-based Tool for Semantic Annotation of Textual Documents
- Authors: Andrea Loreggia, Simone Mosco, Alberto Zerbinati
- Abstract summary: SenTag is a web-based tool focused on semantic annotation of textual documents.
The main goal of the application is two-fold: facilitating the tagging process and reducing or avoiding for errors in the output documents.
It is also possible to assess the level of agreement of annotators working on a corpus of text.
- Score: 4.910379177401659
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we present SenTag, a lightweight web-based tool focused on
semantic annotation of textual documents. The platform allows multiple users to
work on a corpus of documents. The tool enables to tag a corpus of documents
through an intuitive and easy-to-use user interface that adopts the Extensible
Markup Language (XML) as output format. The main goal of the application is
two-fold: facilitating the tagging process and reducing or avoiding for errors
in the output documents. Moreover, it allows to identify arguments and other
entities that are used to build an arguments graph. It is also possible to
assess the level of agreement of annotators working on a corpus of text.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [79.852642726105]
We propose a unified paradigm for parsing visually-situated text across diverse scenarios.
Specifically, we devise a universal model, called Omni, which can simultaneously handle three typical visually-situated text parsing tasks.
In Omni, all tasks share the unified encoder-decoder architecture, the unified objective point-conditioned text generation, and the unified input representation.
arXiv Detail & Related papers (2024-03-28T03:51:14Z) - Magic Markup: Maintaining Document-External Markup with an LLM [1.0538052824177144]
We present a system that re-tags modified programs, enabling rich annotations to automatically follow code as it evolves.
Our system achieves an accuracy of 90% on our benchmarks and can replace a document's tags in parallel at a rate of 5 seconds per tag.
While there remains significant room for improvement, we find performance reliable enough to justify further exploration of applications.
arXiv Detail & Related papers (2024-03-06T05:40:31Z) - WordScape: a Pipeline to extract multilingual, visually rich Documents
with Layout Annotations from Web Crawl Data [13.297444760076406]
We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora.
WordScape parses the Open XML structure of Word documents obtained from the web.
It offers culturally and linguistically diverse document pages with natural semantic structure and high-quality text.
arXiv Detail & Related papers (2023-12-15T20:28:31Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level.
We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z) - Method for Customizable Automated Tagging: Addressing the Problem of
Over-tagging and Under-tagging Text Documents [0.0]
Using author provided tags to predict tags for a new document often results in the overgeneration of tags.
In this paper, we present a method to generate a universal set of tags that can be applied widely to a large document corpus.
arXiv Detail & Related papers (2020-04-30T18:28:42Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.