A Masked Segmental Language Model for Unsupervised Natural Language
Segmentation
- URL: http://arxiv.org/abs/2104.07829v1
- Date: Fri, 16 Apr 2021 00:00:05 GMT
- Title: A Masked Segmental Language Model for Unsupervised Natural Language
Segmentation
- Authors: C.M. Downey, Fei Xia, Gina-Anne Levow, Shane Steinert-Threlkeld
- Abstract summary: We introduce a Masked Segmental Language Model (MSLM) built on a span-masking transformer architecture.
In a series of experiments, our model consistently outperforms Recurrent SLMs on Chinese.
We conclude by discussing the different challenges posed in segmenting phonemic-type writing systems.
- Score: 12.6839867674222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Segmentation remains an important preprocessing step both in languages where
"words" or other important syntactic/semantic units (like morphemes) are not
clearly delineated by white space, as well as when dealing with continuous
speech data, where there is often no meaningful pause between words.
Near-perfect supervised methods have been developed for use in resource-rich
languages such as Chinese, but many of the world's languages are both
morphologically complex, and have no large dataset of "gold" segmentations into
meaningful units. To solve this problem, we propose a new type of Segmental
Language Model (Sun and Deng, 2018; Kawakami et al., 2019; Wang et al., 2021)
for use in both unsupervised and lightly supervised segmentation tasks. We
introduce a Masked Segmental Language Model (MSLM) built on a span-masking
transformer architecture, harnessing the power of a bi-directional masked
modeling context and attention. In a series of experiments, our model
consistently outperforms Recurrent SLMs on Chinese (PKU Corpus) in segmentation
quality, and performs similarly to the Recurrent model on English (PTB). We
conclude by discussing the different challenges posed in segmenting
phonemic-type writing systems.
Related papers
- Evaluating Shortest Edit Script Methods for Contextual Lemmatization [6.0158981171030685]
Modern contextual lemmatizers often rely on automatically induced Shortest Edit Scripts (SES) to transform a word form into its lemma.
Previous work has not investigated the direct impact of SES in the final lemmatization performance.
We show that computing the casing and edit operations separately is beneficial overall, but much more clearly for languages with high-inflected morphology.
arXiv Detail & Related papers (2024-03-25T17:28:24Z) - Universal Segmentation at Arbitrary Granularity with Language
Instruction [59.76130089644841]
We present UniLSeg, a universal segmentation model that can perform segmentation at any semantic level with the guidance of language instructions.
For training UniLSeg, we reorganize a group of tasks from original diverse distributions into a unified data format, where images with texts describing segmentation targets as input and corresponding masks are output.
arXiv Detail & Related papers (2023-12-04T04:47:48Z) - LISA: Reasoning Segmentation via Large Language Model [68.24075852136761]
We propose a new segmentation task -- reasoning segmentation.
The task is designed to output a segmentation mask given a complex and implicit query text.
We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
arXiv Detail & Related papers (2023-08-01T17:50:17Z) - Mitigating Data Imbalance and Representation Degeneration in
Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model.
We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z) - Subword Segmental Language Modelling for Nguni Languages [7.252933737829635]
Subword segmental language model (SSLM) learns how to segment words while being trained for autoregressive language modelling.
We train our model on the 4 Nguni languages of South Africa.
Our results show that learning subword segmentation is an effective alternative to existing subword segmenters.
arXiv Detail & Related papers (2022-10-12T18:41:00Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Canonical and Surface Morphological Segmentation for Nguni Languages [6.805575417034369]
This paper investigates supervised and unsupervised models for morphological segmentation.
We train sequence-to-sequence models for canonical segmentation and Conditional Random Fields (CRF) for surface segmentation.
Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages.
We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.
arXiv Detail & Related papers (2021-04-01T21:06:51Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - BURT: BERT-inspired Universal Representation from Learning Meaningful
Segment [46.51685959045527]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present a universal representation model, BURT, to encode different levels of linguistic unit into the same vector space.
Specifically, we extract and mask meaningful segments based on point-wise mutual information (PMI) to incorporate different granular objectives into the pre-training stage.
arXiv Detail & Related papers (2020-12-28T16:02:28Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.