GCDT: A Chinese RST Treebank for Multigenre and Multilingual Discourse
Parsing
- URL: http://arxiv.org/abs/2210.10449v1
- Date: Wed, 19 Oct 2022 10:27:41 GMT
- Title: GCDT: A Chinese RST Treebank for Multigenre and Multilingual Discourse
Parsing
- Authors: Siyao Peng, Yang Janet Liu, Amir Zeldes
- Abstract summary: GCDT is the largest hierarchical discourse treebank for Mandarin Chinese in the framework of Rhetorical Structure Theory (RST)
We report on this dataset's parsing experiments, including state-of-the-art (SOTA) scores for Chinese RST parsing and RST parsing on the English GUM dataset.
- Score: 9.367612782346207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A lack of large-scale human-annotated data has hampered the hierarchical
discourse parsing of Chinese. In this paper, we present GCDT, the largest
hierarchical discourse treebank for Mandarin Chinese in the framework of
Rhetorical Structure Theory (RST). GCDT covers over 60K tokens across five
genres of freely available text, using the same relation inventory as
contemporary RST treebanks for English. We also report on this dataset's
parsing experiments, including state-of-the-art (SOTA) scores for Chinese RST
parsing and RST parsing on the English GUM dataset, using cross-lingual
training in Chinese and English with multilingual embeddings.
Related papers
- When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Bilingual Rhetorical Structure Parsing with Large Parallel Annotations [5.439020425819001]
We introduce a parallel Russian annotation for the large and diverse English GUM RST corpus.
Our end-to-end RST achieves state-of-the-art results on both English and Russian corpora.
To the best of our knowledge, this work is the first to evaluate the potential of cross-lingual end-to-end RST parsing on a manually annotated parallel corpus.
arXiv Detail & Related papers (2024-09-23T12:40:33Z) - Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts [65.10991154918737]
This study focuses on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China.
Our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels.
To support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans.
arXiv Detail & Related papers (2024-09-02T07:42:55Z) - Simple Yet Effective Neural Ranking and Reranking Baselines for
Cross-Lingual Information Retrieval [50.882816288076725]
Cross-lingual information retrieval is the task of searching documents in one language with queries in another.
We provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold.
We implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese.
arXiv Detail & Related papers (2023-04-03T14:17:00Z) - Joint Chinese Word Segmentation and Span-based Constituency Parsing [11.080040070201608]
This work proposes a method for joint Chinese word segmentation and Span-based Constituency Parsing by adding extra labels to individual Chinese characters on the parse trees.
Through experiments, the proposed algorithm outperforms the recent models for joint segmentation and constituency parsing on CTB 5.1.
arXiv Detail & Related papers (2022-11-03T08:19:00Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Investigating Transfer Learning in Multilingual Pre-trained Language
Models through Chinese Natural Language Inference [11.096793445651313]
We investigate the cross-lingual transfer abilities of XLM-R for Chinese and English natural language inference (NLI)
To better understand linguistic transfer, we created 4 categories of challenge and adversarial tasks for Chinese.
We find that cross-lingual models trained on English NLI do transfer well across our Chinese tasks.
arXiv Detail & Related papers (2021-06-07T22:00:18Z) - Multilingual Neural RST Discourse Parsing [24.986030179701405]
We investigate two approaches to establish a neural, cross-lingual discourse via multilingual vector representations and segment-level translation.
Experiment results show that both methods are effective even with limited training data, and achieve state-of-the-art performance on cross-lingual, document-level discourse parsing.
arXiv Detail & Related papers (2020-12-03T05:03:38Z) - Looking for Clues of Language in Multilingual BERT to Improve
Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information.
We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.