Carolina: a General Corpus of Contemporary Brazilian Portuguese with
Provenance, Typology and Versioning Information
- URL: http://arxiv.org/abs/2303.16098v1
- Date: Tue, 28 Mar 2023 16:09:40 GMT
- Title: Carolina: a General Corpus of Contemporary Brazilian Portuguese with
Provenance, Typology and Versioning Information
- Authors: Maria Clara Ramos Morales Crespo, Maria Lina de Souza Jeannine Rocha,
Mariana Louren\c{c}o Sturzeneker, Felipe Ribas Serras, Guilherme Lamartine de
Mello, Aline Silva Costa, Mayara Feliciano Palma, Renata Morais Mesquita,
Raquel de Paula Guets, Mariana Marques da Silva, Marcelo Finger, Maria Clara
Paix\~ao de Sousa, Cristiane Namiuti, Vanessa Martins do Monte
- Abstract summary: Carolina is a large open corpus of Brazilian Portuguese texts under construction using web-as-corpus methodology.
Carolina's first public version has $653,322,577$ tokens, distributed over $7$ broad types.
- Score: 0.629199190108771
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents the first publicly available version of the Carolina
Corpus and discusses its future directions. Carolina is a large open corpus of
Brazilian Portuguese texts under construction using web-as-corpus methodology
enhanced with provenance, typology, versioning, and text integrality. The
corpus aims at being used both as a reliable source for research in Linguistics
and as an important resource for Computer Science research on language models,
contributing towards removing Portuguese from the set of low-resource
languages. Here we present the construction of the corpus methodology,
comparing it with other existing methodologies, as well as the corpus current
state: Carolina's first public version has $653,322,577$ tokens, distributed
over $7$ broad types. Each text is annotated with several different metadata
categories in its header, which we developed using TEI annotation standards. We
also present ongoing derivative works and invite NLP researchers to contribute
with their own.
Related papers
- Tucano: Advancing Neural Text Generation for Portuguese [0.0]
This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese.
In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens.
Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks.
arXiv Detail & Related papers (2024-11-12T15:06:06Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - TArC: Tunisian Arabish Corpus First complete release [0.0]
We present the final result of a project on Tunisian Arabic encoded in Arabizi.
The project led to the creation of two integrated and independent resources.
arXiv Detail & Related papers (2022-07-11T11:46:59Z) - The Open corpus of the Veps and Karelian languages: overview and
applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009.
The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search.
Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early
Modern French [57.886210204774834]
We present our efforts to develop NLP tools for Early Modern French (historical French from the 16$textth$ to the 18$textth$ centuries).
We present the $textFreEM_textmax$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $textFreEM_textmax$.
arXiv Detail & Related papers (2022-02-18T22:17:22Z) - Prague Dependency Treebank -- Consolidated 1.0 [1.7147127043116672]
Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0)
PDT-C 1.0 contains four different datasets of Czech, uniformly annotated using the standard PDT scheme.
Altogether, the treebank contains around 180,000 sentences with their morphological, surface and deep syntactic annotation.
arXiv Detail & Related papers (2020-06-05T20:52:55Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Mapping Languages: The Corpus of Global Language Use [0.0]
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.
In total, the corpus contains 423 billion words representing 148 languages and 158 countries.
arXiv Detail & Related papers (2020-04-02T03:42:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.