Span Labeling Approach for Vietnamese and Chinese Word Segmentation
- URL: http://arxiv.org/abs/2110.00156v1
- Date: Fri, 1 Oct 2021 01:25:50 GMT
- Title: Span Labeling Approach for Vietnamese and Chinese Word Segmentation
- Authors: Duc-Vu Nguyen, Linh-Bao Vo, Dang Van Thin, Ngan Luu-Thuy Nguyen
- Abstract summary: We propose a span labeling approach to model n-gram information for Vietnamese word segmentation, namely SPAN SEG.
We compare the span labeling approach with the conditional random field by using encoders with the same architecture.
Since Vietnamese and Chinese have similar linguistic phenomena, we evaluated the proposed method on the Vietnamese treebank benchmark dataset and five Chinese benchmark datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a span labeling approach to model n-gram
information for Vietnamese word segmentation, namely SPAN SEG. We compare the
span labeling approach with the conditional random field by using encoders with
the same architecture. Since Vietnamese and Chinese have similar linguistic
phenomena, we evaluated the proposed method on the Vietnamese treebank
benchmark dataset and five Chinese benchmark datasets. Through our experimental
results, the proposed approach SpanSeg achieves higher performance than the
sequence tagging approach with the state-of-the-art F-score of 98.31% on the
Vietnamese treebank benchmark, when they both apply the contextual pre-trained
language model XLM-RoBERTa and the predicted word boundary information.
Besides, we do fine-tuning experiments for the span labeling approach on BERT
and ZEN pre-trained language model for Chinese with fewer parameters, faster
inference time, and competitive or higher F-scores than the previous
state-of-the-art approach, word segmentation with word-hood memory networks, on
five Chinese benchmarks.
Related papers
- Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - WYWEB: A NLP Evaluation Benchmark For Classical Chinese [10.138128038929237]
We introduce the WYWEB evaluation benchmark, which consists of nine NLP tasks in classical Chinese.
We evaluate the existing pre-trained language models, which are all struggling with this benchmark.
arXiv Detail & Related papers (2023-05-23T15:15:11Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Joint Chinese Word Segmentation and Span-based Constituency Parsing [11.080040070201608]
This work proposes a method for joint Chinese word segmentation and Span-based Constituency Parsing by adding extra labels to individual Chinese characters on the parse trees.
Through experiments, the proposed algorithm outperforms the recent models for joint segmentation and constituency parsing on CTB 5.1.
arXiv Detail & Related papers (2022-11-03T08:19:00Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z) - Augmenting Part-of-speech Tagging with Syntactic Information for
Vietnamese and Chinese [0.32228025627337864]
We implement the idea to improve word segmentation and part of speech tagging of the Vietnamese language by employing a simplified constituency.
Our neural model for joint word segmentation and part-of-speech tagging has the architecture of the syllable-based constituency.
This model can be augmented with predicted word boundary and part-of-speech tags by other tools.
arXiv Detail & Related papers (2021-02-24T08:57:02Z) - A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese [11.782566169354725]
We present the first public large-scale Text-to-resource semantic parsing dataset for Vietnamese.
We find that automatic Vietnamese word segmentation improves the parsing results of both baselines.
PhoBERT for Vietnamese helps produce higher performances than the recent best multilingual language model XLM-R.
arXiv Detail & Related papers (2020-10-05T09:54:51Z) - Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix
Capture [2.7528170226206443]
We propose two novel ways to feature extraction, one to reduce the overlap ambiguity and the other to increase the ability to predict unknown words containing suffixes.
Our proposed method obtained a better F1-score than the prior state-of-the-art methods UETsegmenter, and RDRsegmenter.
arXiv Detail & Related papers (2020-06-14T05:19:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.