Augmenting Part-of-speech Tagging with Syntactic Information for
Vietnamese and Chinese
- URL: http://arxiv.org/abs/2102.12136v1
- Date: Wed, 24 Feb 2021 08:57:02 GMT
- Title: Augmenting Part-of-speech Tagging with Syntactic Information for
Vietnamese and Chinese
- Authors: Duc-Vu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
- Abstract summary: We implement the idea to improve word segmentation and part of speech tagging of the Vietnamese language by employing a simplified constituency.
Our neural model for joint word segmentation and part-of-speech tagging has the architecture of the syllable-based constituency.
This model can be augmented with predicted word boundary and part-of-speech tags by other tools.
- Score: 0.32228025627337864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Word segmentation and part-of-speech tagging are two critical preliminary
steps for downstream tasks in Vietnamese natural language processing. In
reality, people tend to consider also the phrase boundary when performing word
segmentation and part of speech tagging rather than solely process word by word
from left to right. In this paper, we implement this idea to improve word
segmentation and part of speech tagging the Vietnamese language by employing a
simplified constituency parser. Our neural model for joint word segmentation
and part-of-speech tagging has the architecture of the syllable-based CRF
constituency parser. To reduce the complexity of parsing, we replace all
constituent labels with a single label indicating for phrases. This model can
be augmented with predicted word boundary and part-of-speech tags by other
tools. Because Vietnamese and Chinese have some similar linguistic phenomena,
we evaluated the proposed model and its augmented versions on three Vietnamese
benchmark datasets and six Chinese benchmark datasets. Our experimental results
show that the proposed model achieves higher performances than previous works
for both languages.
Related papers
- Wav2Gloss: Generating Interlinear Glossed Text from Speech [78.64412090339044]
We propose Wav2Gloss, a task in which four linguistic annotation components are extracted automatically from speech.
We provide various baselines to lay the groundwork for future research on Interlinear Glossed Text generation from speech.
arXiv Detail & Related papers (2024-03-19T21:45:29Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Joint Chinese Word Segmentation and Span-based Constituency Parsing [11.080040070201608]
This work proposes a method for joint Chinese word segmentation and Span-based Constituency Parsing by adding extra labels to individual Chinese characters on the parse trees.
Through experiments, the proposed algorithm outperforms the recent models for joint segmentation and constituency parsing on CTB 5.1.
arXiv Detail & Related papers (2022-11-03T08:19:00Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-stage
Span Labeling [0.2624902795082451]
We propose a neural model named SpanSegTag for joint Chinese word segmentation and part-of-speech tagging.
Our experiments show that our BERT-based model SpanSegTag achieved competitive performances on the CTB5, CTB6, and UD datasets.
arXiv Detail & Related papers (2021-12-17T12:59:02Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - End-to-End Chinese Parsing Exploiting Lexicons [15.786281545363448]
We propose an end-to-end Chinese parsing model based on character inputs which jointly learns to output word segmentation, part-of-speech tags and dependency structures.
Our parsing model relies on word-char graph attention networks, which can enrich the character inputs with external word knowledge.
arXiv Detail & Related papers (2020-12-08T12:24:36Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix
Capture [2.7528170226206443]
We propose two novel ways to feature extraction, one to reduce the overlap ambiguity and the other to increase the ability to predict unknown words containing suffixes.
Our proposed method obtained a better F1-score than the prior state-of-the-art methods UETsegmenter, and RDRsegmenter.
arXiv Detail & Related papers (2020-06-14T05:19:46Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z) - Investigating Language Impact in Bilingual Approaches for Computational
Language Documentation [28.838960956506018]
This paper investigates how the choice of translation language affects the posterior documentation work.
We create 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment.
Our results suggest that incorporating clues into the neural models' input representation increases their translation and alignment quality.
arXiv Detail & Related papers (2020-03-30T10:30:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.