Related papers: Parsing Through Boundaries in Chinese Word Segmentation

Parsing Through Boundaries in Chinese Word Segmentation

URL: http://arxiv.org/abs/2503.23091v1
Date: Sat, 29 Mar 2025 14:24:02 GMT
Title: Parsing Through Boundaries in Chinese Word Segmentation
Authors: Yige Chen, Zelong Li, Changbing Yang, Cindy Zhang, Amandisa Cady, Ai Ka Lee, Zejiao Zeng, Haihua Pan, Jungyeul Park,
Abstract summary: Unlike English, Chinese lacks explicit word boundaries, making segmentation both necessary and inherently ambiguous.<n>This study highlights the intricate relationship between word segmentation and syntactic parsing, providing a clearer understanding of how different segmentation strategies shape dependency structures in Chinese.
Score: 4.74872130711676
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Chinese word segmentation is a foundational task in natural language processing (NLP), with far-reaching effects on syntactic analysis. Unlike alphabetic languages like English, Chinese lacks explicit word boundaries, making segmentation both necessary and inherently ambiguous. This study highlights the intricate relationship between word segmentation and syntactic parsing, providing a clearer understanding of how different segmentation strategies shape dependency structures in Chinese. Focusing on the Chinese GSD treebank, we analyze multiple word boundary schemes, each reflecting distinct linguistic and computational assumptions, and examine how they influence the resulting syntactic structures. To support detailed comparison, we introduce an interactive web-based visualization tool that displays parsing outcomes across segmentation methods.

Related papers

Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages [0.0]
We define syntactic structures as delexicalized dependency (sub)trees and extract them from spoken and written Universal Dependencies treebanks.<n>For each corpus, we analyze the size, diversity, and distribution of syntactic inventories, their overlap across modalities, and the structures most characteristic of speech.<n>Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts.
arXiv Detail & Related papers (2025-05-28T18:43:26Z)
Visual Prompt Selection for In-Context Learning Segmentation [77.15684360470152]
In this paper, we focus on rethinking and improving the example selection strategy. We first demonstrate that ICL-based segmentation models are sensitive to different contexts. Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation.
arXiv Detail & Related papers (2024-07-14T15:02:54Z)
Character-Level Chinese Dependency Parsing via Modeling Latent Intra-Word Structure [11.184330703168893]
This paper proposes modeling latent internal structures within words in Chinese. A constrained Eisner algorithm is implemented to ensure the compatibility of character-level trees. A detailed analysis reveals that a coarse-to-fine parsing strategy empowers the model to predict more linguistically plausible intra-word structures.
arXiv Detail & Related papers (2024-06-06T06:23:02Z)
A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models. We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z)
mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora. We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER) Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z)
Hierarchical Open-vocabulary Universal Image Segmentation [48.008887320870244]
Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions. We propose a decoupled text-image fusion mechanism and representation learning modules for both "things" and "stuff" Our resulting model, named HIPIE tackles, HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a unified framework.
arXiv Detail & Related papers (2023-07-03T06:02:15Z)
Joint Chinese Word Segmentation and Span-based Constituency Parsing [11.080040070201608]
This work proposes a method for joint Chinese word segmentation and Span-based Constituency Parsing by adding extra labels to individual Chinese characters on the parse trees. Through experiments, the proposed algorithm outperforms the recent models for joint segmentation and constituency parsing on CTB 5.1.
arXiv Detail & Related papers (2022-11-03T08:19:00Z)
Multilingual Extraction and Categorization of Lexical Collocations with Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context. Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z)
Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-stage Span Labeling [0.2624902795082451]
We propose a neural model named SpanSegTag for joint Chinese word segmentation and part-of-speech tagging. Our experiments show that our BERT-based model SpanSegTag achieved competitive performances on the CTB5, CTB6, and UD datasets.
arXiv Detail & Related papers (2021-12-17T12:59:02Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space. We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance. We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z)
More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ. We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)
Augmenting Part-of-speech Tagging with Syntactic Information for Vietnamese and Chinese [0.32228025627337864]
We implement the idea to improve word segmentation and part of speech tagging of the Vietnamese language by employing a simplified constituency. Our neural model for joint word segmentation and part-of-speech tagging has the architecture of the syllable-based constituency. This model can be augmented with predicted word boundary and part-of-speech tags by other tools.
arXiv Detail & Related papers (2021-02-24T08:57:02Z)
End-to-End Chinese Parsing Exploiting Lexicons [15.786281545363448]
We propose an end-to-end Chinese parsing model based on character inputs which jointly learns to output word segmentation, part-of-speech tags and dependency structures. Our parsing model relies on word-char graph attention networks, which can enrich the character inputs with external word knowledge.
arXiv Detail & Related papers (2020-12-08T12:24:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.