The GINCO Training Dataset for Web Genre Identification of Documents Out
in the Wild
- URL: http://arxiv.org/abs/2201.03857v1
- Date: Tue, 11 Jan 2022 09:39:15 GMT
- Title: The GINCO Training Dataset for Web Genre Identification of Documents Out
in the Wild
- Authors: Taja Kuzman, Peter Rupnik and Nikola Ljube\v{s}i\'c
- Abstract summary: The dataset consists of various challenges related to web-based data, such as machine translated content, encoding errors, multiple contents presented in one document etc.
The initial machine learning experiments on the dataset show that (1) pre-Transformer models are drastically less able to model the phenomena, with macro F1 metrics ranging around 0.22, while Transformer-based models achieve scores of around 0.58, and (2) multilingual Transformer models work as well on the task as the monolingual models that were previously proven to be superior to multilingual models on standard NLP tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper presents a new training dataset for automatic genre identification
GINCO, which is based on 1,125 crawled Slovenian web documents that consist of
650 thousand words. Each document was manually annotated for genre with a new
annotation schema that builds upon existing schemata, having primarily clarity
of labels and inter-annotator agreement in mind. The dataset consists of
various challenges related to web-based data, such as machine translated
content, encoding errors, multiple contents presented in one document etc.,
enabling evaluation of classifiers in realistic conditions. The initial machine
learning experiments on the dataset show that (1) pre-Transformer models are
drastically less able to model the phenomena, with macro F1 metrics ranging
around 0.22, while Transformer-based models achieve scores of around 0.58, and
(2) multilingual Transformer models work as well on the task as the monolingual
models that were previously proven to be superior to multilingual models on
standard NLP tasks.
Related papers
- GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text [39.846419973203744]
We compile the largest existing corpus of interlinear glossed text (IGT) data from a variety of sources, covering over 450k examples across 1.8k languages.
We normalize much of our data to follow a standard set of labels across languages.
As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus.
We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6%.
arXiv Detail & Related papers (2024-03-11T03:21:15Z) - A Multi-Modal Multilingual Benchmark for Document Image Classification [21.7518357653137]
We introduce two newly curated multilingual datasets WIKI-DOC and MULTIEUR-DOCLEX.
We study popular visually-rich document understanding or Document AI models in previously untested setting in document image classification.
Experimental results show limitations of multilingual Document AI models on cross-lingual transfer across typologically distant languages.
arXiv Detail & Related papers (2023-10-25T04:35:06Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - BERT-Flow-VAE: A Weakly-supervised Model for Multi-Label Text
Classification [0.5156484100374058]
We propose BERT-Flow-VAE (BFV), a Weakly-Supervised Multi-Label Text Classification model that reduces the need for full supervision.
Experimental results on 6 multi-label datasets show that BFV can substantially outperform other baseline WSMLTC models in key metrics.
arXiv Detail & Related papers (2022-10-27T07:18:56Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - DOCmT5: Document-Level Pretraining of Multilingual Language Models [9.072507490639218]
We introduce DOCmT5, a multilingual sequence-to-sequence language model pre-trained with large scale parallel documents.
We propose a simple and effective pre-training objective - Document Reordering Machine Translation.
DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks.
arXiv Detail & Related papers (2021-12-16T08:58:52Z) - Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z) - Rethinking Document-level Neural Machine Translation [73.42052953710605]
We try to answer the question: Is the capacity of current models strong enough for document-level translation?
We observe that the original Transformer with appropriate training techniques can achieve strong results for document translation, even with a length of 2000 words.
arXiv Detail & Related papers (2020-10-18T11:18:29Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z) - Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context.
We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.