Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-stage
Span Labeling
- URL: http://arxiv.org/abs/2112.09488v1
- Date: Fri, 17 Dec 2021 12:59:02 GMT
- Title: Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-stage
Span Labeling
- Authors: Duc-Vu Nguyen, Linh-Bao Vo, Ngoc-Linh Tran, Kiet Van Nguyen, Ngan
Luu-Thuy Nguyen
- Abstract summary: We propose a neural model named SpanSegTag for joint Chinese word segmentation and part-of-speech tagging.
Our experiments show that our BERT-based model SpanSegTag achieved competitive performances on the CTB5, CTB6, and UD datasets.
- Score: 0.2624902795082451
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chinese word segmentation and part-of-speech tagging are necessary tasks in
terms of computational linguistics and application of natural language
processing. Many re-searchers still debate the demand for Chinese word
segmentation and part-of-speech tagging in the deep learning era. Nevertheless,
resolving ambiguities and detecting unknown words are challenging problems in
this field. Previous studies on joint Chinese word segmentation and
part-of-speech tagging mainly follow the character-based tagging model focusing
on modeling n-gram features. Unlike previous works, we propose a neural model
named SpanSegTag for joint Chinese word segmentation and part-of-speech tagging
following the span labeling in which the probability of each n-gram being the
word and the part-of-speech tag is the main problem. We use the biaffine
operation over the left and right boundary representations of consecutive
characters to model the n-grams. Our experiments show that our BERT-based model
SpanSegTag achieved competitive performances on the CTB5, CTB6, and UD, or
significant improvements on CTB7 and CTB9 benchmark datasets compared with the
current state-of-the-art method using BERT or ZEN encoders.
Related papers
- A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available.
Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z) - Linguistically Motivated Sign Language Segmentation [51.06873383204105]
We consider two kinds of segmentation: segmentation into individual signs and segmentation into phrases.
Our method is motivated by linguistic cues observed in sign language corpora.
We replace the predominant IO tagging scheme with BIO tagging to account for continuous signing.
arXiv Detail & Related papers (2023-10-21T10:09:34Z) - Betrayed by Captions: Joint Caption Grounding and Generation for Open
Vocabulary Instance Segmentation [80.48979302400868]
We focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories.
Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and captions in nouns.
We devise a joint textbfCaption Grounding and Generation (CGG) framework, which incorporates a novel grounding loss that only focuses on matching object to improve learning efficiency.
arXiv Detail & Related papers (2023-01-02T18:52:12Z) - Joint Chinese Word Segmentation and Span-based Constituency Parsing [11.080040070201608]
This work proposes a method for joint Chinese word segmentation and Span-based Constituency Parsing by adding extra labels to individual Chinese characters on the parse trees.
Through experiments, the proposed algorithm outperforms the recent models for joint segmentation and constituency parsing on CTB 5.1.
arXiv Detail & Related papers (2022-11-03T08:19:00Z) - Neural Token Segmentation for High Token-Internal Complexity [7.569526565230962]
Tokenizing raw texts into word units is an essential pre-processing step for NLP pipelines.
We propose a novel neural segmentation model which combines contextualised token representation and char-level decoding.
Our model shows substantial improvements in segmentation accuracy on Hebrew and Arabic compared to the state-of-the-art.
arXiv Detail & Related papers (2022-03-21T10:07:17Z) - MarkBERT: Marking Word Boundaries Improves Chinese BERT [67.53732128091747]
MarkBERT keeps the vocabulary being Chinese characters and inserts boundary markers between contiguous words.
Compared to previous word-based BERT models, MarkBERT achieves better accuracy on text classification, keyword recognition, and semantic similarity tasks.
arXiv Detail & Related papers (2022-03-12T08:43:06Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Augmenting Part-of-speech Tagging with Syntactic Information for
Vietnamese and Chinese [0.32228025627337864]
We implement the idea to improve word segmentation and part of speech tagging of the Vietnamese language by employing a simplified constituency.
Our neural model for joint word segmentation and part-of-speech tagging has the architecture of the syllable-based constituency.
This model can be augmented with predicted word boundary and part-of-speech tags by other tools.
arXiv Detail & Related papers (2021-02-24T08:57:02Z) - Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention [19.520840812910357]
Sindhi word segmentation is a challenging task due to space omission and insertion issues.
Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features.
We propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task.
arXiv Detail & Related papers (2020-12-30T08:31:31Z) - Improving Chinese Segmentation-free Word Embedding With Unsupervised
Association Measure [3.9435648520559177]
segmentation-free word embedding model is proposed by collecting n-grams vocabulary via a novel unsupervised association measure called pointwise association with times information(PATI)
The proposed method leverages more latent information from the corpus and thus is able to collect more valid n-grams that have stronger cohesion as embedding targets in unsegmented language data, such as Chinese texts.
arXiv Detail & Related papers (2020-07-05T13:55:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.