Related papers: The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild

The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild

URL: http://arxiv.org/abs/2201.03857v1
Date: Tue, 11 Jan 2022 09:39:15 GMT
Title: The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild
Authors: Taja Kuzman, Peter Rupnik and Nikola Ljube\v{s}i\'c
Abstract summary: The dataset consists of various challenges related to web-based data, such as machine translated content, encoding errors, multiple contents presented in one document etc. The initial machine learning experiments on the dataset show that (1) pre-Transformer models are drastically less able to model the phenomena, with macro F1 metrics ranging around 0.22, while Transformer-based models achieve scores of around 0.58, and (2) multilingual Transformer models work as well on the task as the monolingual models that were previously proven to be superior to multilingual models on standard NLP tasks.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper presents a new training dataset for automatic genre identification GINCO, which is based on 1,125 crawled Slovenian web documents that consist of 650 thousand words. Each document was manually annotated for genre with a new annotation schema that builds upon existing schemata, having primarily clarity of labels and inter-annotator agreement in mind. The dataset consists of various challenges related to web-based data, such as machine translated content, encoding errors, multiple contents presented in one document etc., enabling evaluation of classifiers in realistic conditions. The initial machine learning experiments on the dataset show that (1) pre-Transformer models are drastically less able to model the phenomena, with macro F1 metrics ranging around 0.22, while Transformer-based models achieve scores of around 0.58, and (2) multilingual Transformer models work as well on the task as the monolingual models that were previously proven to be superior to multilingual models on standard NLP tasks.

Related papers

GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text [39.846419973203744]
We compile the largest existing corpus of interlinear glossed text (IGT) data from a variety of sources, covering over 450k examples across 1.8k languages. We normalize much of our data to follow a standard set of labels across languages. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6%.
arXiv Detail & Related papers (2024-03-11T03:21:15Z)
A Multi-Modal Multilingual Benchmark for Document Image Classification [21.7518357653137]
We introduce two newly curated multilingual datasets WIKI-DOC and MULTIEUR-DOCLEX. We study popular visually-rich document understanding or Document AI models in previously untested setting in document image classification. Experimental results show limitations of multilingual Document AI models on cross-lingual transfer across typologically distant languages.
arXiv Detail & Related papers (2023-10-25T04:35:06Z)
An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages. We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z)
BERT-Flow-VAE: A Weakly-supervised Model for Multi-Label Text Classification [0.5156484100374058]
We propose BERT-Flow-VAE (BFV), a Weakly-Supervised Multi-Label Text Classification model that reduces the need for full supervision. Experimental results on 6 multi-label datasets show that BFV can substantially outperform other baseline WSMLTC models in key metrics.
arXiv Detail & Related papers (2022-10-27T07:18:56Z)
Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods. We conducted three types of experiments -- monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z)
DOCmT5: Document-Level Pretraining of Multilingual Language Models [9.072507490639218]
We introduce DOCmT5, a multilingual sequence-to-sequence language model pre-trained with large scale parallel documents. We propose a simple and effective pre-training objective - Document Reordering Machine Translation. DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks.
arXiv Detail & Related papers (2021-12-16T08:58:52Z)
Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields. Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z)
Rethinking Document-level Neural Machine Translation [73.42052953710605]
We try to answer the question: Is the capacity of current models strong enough for document-level translation? We observe that the original Transformer with appropriate training techniques can achieve strong results for document translation, even with a length of 2000 words.
arXiv Detail & Related papers (2020-10-18T11:18:29Z)
SPECTER: Document-level Representation Learning using Citation-informed Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model. We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context. We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.