Related papers: More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models

More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models

URL: http://arxiv.org/abs/2108.10755v1
Date: Tue, 24 Aug 2021 14:08:19 GMT
Title: More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models
Authors: Jin Cheevaprawatdomrong, Alexandra Schofield, Attapol T. Rutherford
Abstract summary: We propose a new metric for measuring the clustering quality in settings where the models differ. We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
Score: 71.42030830910227
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditionally, Latent Dirichlet Allocation (LDA) ingests words in a collection of documents to discover their latent topics using word-document co-occurrences. However, it is unclear how to achieve the best results for languages without marked word boundaries such as Chinese and Thai. Here, we explore the use of Pearson's chi-squared test, t-statistics, and Word Pair Encoding (WPE) to produce tokens as input to the LDA model. The Chi-squared, t, and WPE tokenizers are trained on Wikipedia text to look for words that should be grouped together, such as compound nouns, proper nouns, and complex event verbs. We propose a new metric for measuring the clustering quality in settings where the vocabularies of the models differ. Based on this metric and other established metrics, we show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.

Related papers

Comparative analysis of subword tokenization approaches for Indian languages [5.012314384895538]
Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process.<n>Subword tokenization enhances this process by breaking down words into smaller subword units.<n>It is useful in capturing the intricate structure of words in Indian languages (ILs), such as prefixes, suffixes, and other morphological variations.<n>This paper examines how different subword tokenization techniques, such as SentencePiece, Byte Pair, and WordPiece Tokenization, affect ILs.
arXiv Detail & Related papers (2025-05-22T16:24:37Z)
Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models [88.07940818022468]
We take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs) We form "semantic tokens" by merging the semantically similar subwords and their embeddings. inspections on the grouped subwords show that they exhibit a wide range of semantic similarities.
arXiv Detail & Related papers (2024-11-07T08:38:32Z)
A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models. We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z)
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents. We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z)
Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models [12.0190584907439]
We propose a new method to exploit word structure and integrate lexical semantics into character representations of pre-trained models. We show that our approach achieves superior performance over the basic pre-trained models BERT, BERT-wwm and ERNIE on different Chinese NLP tasks.
arXiv Detail & Related papers (2022-07-13T02:28:08Z)
Impact of Tokenization on Language Models: An Analysis for Turkish [2.4660652494309936]
We train tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. We find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers.
arXiv Detail & Related papers (2022-04-19T12:01:46Z)
Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem. For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token. We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
Improving Chinese Segmentation-free Word Embedding With Unsupervised Association Measure [3.9435648520559177]
segmentation-free word embedding model is proposed by collecting n-grams vocabulary via a novel unsupervised association measure called pointwise association with times information(PATI) The proposed method leverages more latent information from the corpus and thus is able to collect more valid n-grams that have stronger cohesion as embedding targets in unsegmented language data, such as Chinese texts.
arXiv Detail & Related papers (2020-07-05T13:55:19Z)
Language-Independent Tokenisation Rivals Language-Specific Tokenisation for Word Similarity Prediction [12.376752724719005]
Language-independent tokenisation (LIT) methods do not require labelled language resources or lexicons. Language-specific tokenisation (LST) methods have a long and established history, and are developed using carefully created lexicons and training resources. We empirically compare the two approaches using semantic similarity measurement as an evaluation task across a diverse set of languages.
arXiv Detail & Related papers (2020-02-25T16:24:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.