The Annotation Guideline of LST20 Corpus
- URL: http://arxiv.org/abs/2008.05055v1
- Date: Wed, 12 Aug 2020 01:16:45 GMT
- Title: The Annotation Guideline of LST20 Corpus
- Authors: Prachya Boonkwan and Vorapon Luantangsrisuk and Sitthaa Phaholphinyo
and Kanyanat Kriengket and Dhanon Leenoi and Charun Phrombut and Monthika
Boriboon and Krit Kosawat and Thepchai Supnithi
- Abstract summary: The dataset complies to the CoNLL-2003-style format for ease of use.
At a large scale, it consists of 3,164,864 words, 288,020 named entities, 248,962 clauses, and 74,180 sentences.
All 3,745 documents are also annotated with 15 news genres.
- Score: 0.3161954199291541
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This report presents the annotation guideline for LST20, a large-scale corpus
with multiple layers of linguistic annotation for Thai language processing. Our
guideline consists of five layers of linguistic annotation: word segmentation,
POS tagging, named entities, clause boundaries, and sentence boundaries. The
dataset complies to the CoNLL-2003-style format for ease of use. LST20 Corpus
offers five layers of linguistic annotation as aforementioned. At a large
scale, it consists of 3,164,864 words, 288,020 named entities, 248,962 clauses,
and 74,180 sentences, while it is annotated with 16 distinct POS tags. All
3,745 documents are also annotated with 15 news genres. Regarding its sheer
size, this dataset is considered large enough for developing joint neural
models for NLP. With the existence of this publicly available corpus, Thai has
become a linguistically rich language for the first time.
Related papers
- Cross-lingual Named Entity Corpus for Slavic Languages [1.8693484642696736]
This work is the result of a series of shared tasks, conducted in 2017-2023 as a part of the Workshops on Slavic Natural Language Processing.
The corpus consists of 5 017 documents on seven topics. The documents are annotated with five classes of named entities.
arXiv Detail & Related papers (2024-03-30T22:20:08Z) - Wav2Gloss: Generating Interlinear Glossed Text from Speech [78.64412090339044]
We propose Wav2Gloss, a task in which four linguistic annotation components are extracted automatically from speech.
We provide various baselines to lay the groundwork for future research on Interlinear Glossed Text generation from speech.
arXiv Detail & Related papers (2024-03-19T21:45:29Z) - Advancing Multilingual Pre-training: TRIP Triangular Document-level
Pre-training for Multilingual Language Models [107.83158521848372]
We present textbfTriangular Document-level textbfPre-training (textbfTRIP), which is the first in the field to accelerate the conventional monolingual and bilingual objectives into a trilingual objective with a novel method called Grafting.
TRIP achieves several strong state-of-the-art (SOTA) scores on three multilingual document-level machine translation benchmarks and one cross-lingual abstractive summarization benchmark, including consistent improvements by up to 3.11 d-BLEU points and 8.9 ROUGE-L points.
arXiv Detail & Related papers (2022-12-15T12:14:25Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Potential Idiomatic Expression (PIE)-English: Corpus for Classes of
Idioms [1.6111818380407035]
This is the first dataset with classes of idioms beyond the literal and the general idioms classification.
This dataset contains over 20,100 samples with almost 1,200 cases of idioms (with their meanings) from 10 classes (or senses)
arXiv Detail & Related papers (2021-04-25T13:05:29Z) - Prague Dependency Treebank -- Consolidated 1.0 [1.7147127043116672]
Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0)
PDT-C 1.0 contains four different datasets of Czech, uniformly annotated using the standard PDT scheme.
Altogether, the treebank contains around 180,000 sentences with their morphological, surface and deep syntactic annotation.
arXiv Detail & Related papers (2020-06-05T20:52:55Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z) - Validation and Normalization of DCS corpus using Sanskrit Heritage tools
to build a tagged Gold Corpus [0.0]
The Digital Corpus of Sanskrit records around 650,000 sentences along with their morphological and lexical tagging.
The Sanskrit Heritage Engine's Reader produces all possible segmentations with morphological and lexical analyses.
arXiv Detail & Related papers (2020-05-13T19:23:43Z) - Mapping Languages: The Corpus of Global Language Use [0.0]
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.
In total, the corpus contains 423 billion words representing 148 languages and 158 countries.
arXiv Detail & Related papers (2020-04-02T03:42:14Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.