Related papers: A Practical Chinese Dependency Parser Based on A Large-scale Dataset

A Practical Chinese Dependency Parser Based on A Large-scale Dataset

URL: http://arxiv.org/abs/2009.00901v2
Date: Thu, 3 Sep 2020 02:42:29 GMT
Title: A Practical Chinese Dependency Parser Based on A Large-scale Dataset
Authors: Shuai Zhang, Lijie Wang, Ke Sun, Xinyan Xiao
Abstract summary: Dependency parsing is a longstanding natural language processing task, with its outputs crucial to various downstream tasks. Recently, neural network based (NN-based) dependency has achieved significant progress and obtained the state-of-the-art results. As we all know, NN-based approaches require massive amounts of labeled training data, which is very expensive because it requires human annotation by experts.
Score: 21.359679124869402
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dependency parsing is a longstanding natural language processing task, with its outputs crucial to various downstream tasks. Recently, neural network based (NN-based) dependency parsing has achieved significant progress and obtained the state-of-the-art results. As we all know, NN-based approaches require massive amounts of labeled training data, which is very expensive because it requires human annotation by experts. Thus few industrial-oriented dependency parser tools are publicly available. In this report, we present Baidu Dependency Parser (DDParser), a new Chinese dependency parser trained on a large-scale manually labeled dataset called Baidu Chinese Treebank (DuCTB). DuCTB consists of about one million annotated sentences from multiple sources including search logs, Chinese newswire, various forum discourses, and conversation programs. DDParser is extended on the graph-based biaffine parser to accommodate to the characteristics of Chinese dataset. We conduct experiments on two test sets: the standard test set with the same distribution as the training set and the random test set sampled from other sources, and the labeled attachment scores (LAS) of them are 92.9% and 86.9% respectively. DDParser achieves the state-of-the-art results, and is released at https://github.com/baidu/DDParser.

Related papers

Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages [11.627508350795118]
BiLingua is a pipeline for Universal Dependencies (UD) annotations for code-switched text.<n>First, we develop a prompt-based framework for Spanish-English and Spanish-Guaran'i data.<n>Second, we release two datasets, including the first Spanish-Guaran'i-parsed corpus.<n>Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts.
arXiv Detail & Related papers (2025-06-08T20:23:57Z)
Sinhala-English Parallel Word Dictionary Dataset [0.554780083433538]
We introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages.
arXiv Detail & Related papers (2023-08-04T10:21:35Z)
READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises [87.70001456418504]
We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises. READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input. We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
arXiv Detail & Related papers (2023-02-14T20:14:39Z)
Substructure Distribution Projection for Zero-Shot Cross-Lingual Dependency Parsing [55.69800855705232]
SubDP is a technique that projects a distribution over structures in one domain to another, by projecting substructure distributions separately. We evaluate SubDP on zero-shot cross-lingual dependency parsing, taking dependency arcs as substructures.
arXiv Detail & Related papers (2021-10-16T10:12:28Z)
Multilingual Compositional Wikidata Questions [9.602430657819564]
We propose a method for creating a multilingual, parallel dataset of question-Query pairs grounded in Wikidata. We use this data to train semantics for Hebrew, Kannada, Chinese and English to better understand the current strengths and weaknesses of multilingual semantic parsing.
arXiv Detail & Related papers (2021-08-07T19:40:38Z)
DT-grams: Structured Dependency Grammar Stylometry for Cross-Language Authorship Attribution [0.20305676256390934]
We present a novel language-independent feature for authorship analysis based on dependency graphs and universal part of speech tags, called DT-grams. We evaluate DT-grams by performing cross-language authorship attribution on untranslated datasets of bilingual authors.
arXiv Detail & Related papers (2021-06-10T11:50:07Z)
GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations. GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree. We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z)
N-LTP: An Open-source Neural Language Technology Platform for Chinese [68.58732970171747]
textttN- is an open-source neural language technology platform supporting six fundamental Chinese NLP tasks. textttN- adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks.
arXiv Detail & Related papers (2020-09-24T11:45:39Z)
Neural Approaches for Data Driven Dependency Parsing in Sanskrit [19.844420181108177]
We evaluate four different data-driven machine learning models, originally proposed for different languages, and compare their performances on Sanskrit data. We compare the performance of each of the models in a low-resource setting, with 1,500 sentences for training. We also investigate the impact of word ordering in which the sentences are provided as input to these systems, by parsing verses and their corresponding prose order.
arXiv Detail & Related papers (2020-04-17T06:47:15Z)
Towards Instance-Level Parser Selection for Cross-Lingual Transfer of Dependency Parsers [59.345145623931636]
We argue for a novel cross-lingual transfer paradigm: instance-level selection (ILPS) We present a proof-of-concept study focused on instance-level selection in the framework of delexicalized transfer.
arXiv Detail & Related papers (2020-04-16T13:18:55Z)
Bootstrapping a Crosslingual Semantic Parser [74.99223099702157]
We adapt a semantic trained on a single language, such as English, to new languages and multiple domains with minimal annotation. We query if machine translation is an adequate substitute for training data, and extend this to investigate bootstrapping using joint training with English, paraphrasing, and multilingual pre-trained models.
arXiv Detail & Related papers (2020-04-06T12:05:02Z)
Cross-Lingual Adaptation Using Universal Dependencies [1.027974860479791]
We show that models trained using UD parse trees for complex NLP tasks can characterize very different languages. Based on UD parse trees, we develop several models using tree kernels and show that these models trained on the English dataset can correctly classify data of other languages.
arXiv Detail & Related papers (2020-03-24T13:04:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.