Related papers: Improving Chinese Segmentation-free Word Embedding With Unsupervised Association Measure

Improving Chinese Segmentation-free Word Embedding With Unsupervised Association Measure

URL: http://arxiv.org/abs/2007.02342v1
Date: Sun, 5 Jul 2020 13:55:19 GMT
Title: Improving Chinese Segmentation-free Word Embedding With Unsupervised Association Measure
Authors: Yifan Zhang, Maohua Wang, Yongjian Huang, Qianrong Gu
Abstract summary: segmentation-free word embedding model is proposed by collecting n-grams vocabulary via a novel unsupervised association measure called pointwise association with times information(PATI) The proposed method leverages more latent information from the corpus and thus is able to collect more valid n-grams that have stronger cohesion as embedding targets in unsegmented language data, such as Chinese texts.
Score: 3.9435648520559177
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work on segmentation-free word embedding(sembei) developed a new pipeline of word embedding for unsegmentated language while avoiding segmentation as a preprocessing step. However, too many noisy n-grams existing in the embedding vocabulary that do not have strong association strength between characters would limit the quality of learned word embedding. To deal with this problem, a new version of segmentation-free word embedding model is proposed by collecting n-grams vocabulary via a novel unsupervised association measure called pointwise association with times information(PATI). Comparing with the commonly used n-gram filtering method like frequency used in sembei and pointwise mutual information(PMI), the proposed method leverages more latent information from the corpus and thus is able to collect more valid n-grams that have stronger cohesion as embedding targets in unsegmented language data, such as Chinese texts. Further experiments on Chinese SNS data show that the proposed model improves performance of word embedding in downstream tasks.

Related papers

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords. We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z)
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents. We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z)
Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation [80.48979302400868]
We focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories. Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and captions in nouns. We devise a joint textbfCaption Grounding and Generation (CGG) framework, which incorporates a novel grounding loss that only focuses on matching object to improve learning efficiency.
arXiv Detail & Related papers (2023-01-02T18:52:12Z)
Always Keep your Target in Mind: Studying Semantics and Improving Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models. We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z)
NFLAT: Non-Flat-Lattice Transformer for Chinese Named Entity Recognition [39.308634515653914]
We advocate a novel lexical enhancement method, InterFormer, that effectively reduces the amount of computational and memory costs. Compared with FLAT, it reduces unnecessary attention calculations in "word-character" and "word-word" This reduces the memory usage by about 50% and can use more extensive lexicons or higher batches for network training.
arXiv Detail & Related papers (2022-05-12T01:55:37Z)
Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-stage Span Labeling [0.2624902795082451]
We propose a neural model named SpanSegTag for joint Chinese word segmentation and part-of-speech tagging. Our experiments show that our BERT-based model SpanSegTag achieved competitive performances on the CTB5, CTB6, and UD datasets.
arXiv Detail & Related papers (2021-12-17T12:59:02Z)
Hierarchical Heterogeneous Graph Representation Learning for Short Text Classification [60.233529926965836]
We propose a new method called SHINE, which is based on graph neural network (GNN) for short text classification. First, we model the short text dataset as a hierarchical heterogeneous graph consisting of word-level component graphs. Then, we dynamically learn a short document graph that facilitates effective label propagation among similar short texts.
arXiv Detail & Related papers (2021-10-30T05:33:05Z)
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration [25.159601117722936]
We propose a contrastive fine-tuning objective that enables BERT to produce more powerful phrase embeddings. Our approach relies on a dataset of diverse phrasal paraphrases, which is automatically generated using a paraphrase generation model. As a case study, we show that Phrase-BERT embeddings can be easily integrated with a simple autoencoder to build a phrase-based neural topic model.
arXiv Detail & Related papers (2021-09-13T20:31:57Z)
More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ. We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)
Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention [19.520840812910357]
Sindhi word segmentation is a challenging task due to space omission and insertion issues. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. We propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task.
arXiv Detail & Related papers (2020-12-30T08:31:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.