A Subword Guided Neural Word Segmentation Model for Sindhi
- URL: http://arxiv.org/abs/2012.15079v1
- Date: Wed, 30 Dec 2020 08:31:31 GMT
- Title: A Subword Guided Neural Word Segmentation Model for Sindhi
- Authors: Wazir Ali, Jay Kumar, Zenglin Xu, Congjian Luo, Junyu Lu, Junming
Shao, Rajesh Kumar, and Yazhou Ren
- Abstract summary: Deep neural networks employ multiple processing layers for learning text representations to alleviate the burden of manual feature engineering in Natural Language Processing (NLP)
In this paper, we investigate supervised Sindhi Word (SWS) using unlabeled data with a Subword Guided Word Segmenter (SGNWS) for Sindhi.
Our proposed SGNWS model achieves an F1 value of 98.51% without relying on feature engineering.
- Score: 24.489385463262188
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep neural networks employ multiple processing layers for learning text
representations to alleviate the burden of manual feature engineering in
Natural Language Processing (NLP). Such text representations are widely used to
extract features from unlabeled data. The word segmentation is a fundamental
and inevitable prerequisite for many languages. Sindhi is an under-resourced
language, whose segmentation is challenging as it exhibits space omission,
space insertion issues, and lacks the labeled corpus for segmentation. In this
paper, we investigate supervised Sindhi Word Segmentation (SWS) using unlabeled
data with a Subword Guided Neural Word Segmenter (SGNWS) for Sindhi. In order
to learn text representations, we incorporate subword representations to
recurrent neural architecture to capture word information at morphemic-level,
which takes advantage of Bidirectional Long-Short Term Memory (BiLSTM),
self-attention mechanism, and Conditional Random Field (CRF). Our proposed
SGNWS model achieves an F1 value of 98.51% without relying on feature
engineering. The empirical results demonstrate the benefits of the proposed
model over the existing Sindhi word segmenters.
Related papers
- Segment and Caption Anything [126.20201216616137]
We propose a method to efficiently equip the Segment Anything Model with the ability to generate regional captions.
By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation.
We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice.
arXiv Detail & Related papers (2023-12-01T19:00:17Z) - LISA: Reasoning Segmentation via Large Language Model [68.24075852136761]
We propose a new segmentation task -- reasoning segmentation.
The task is designed to output a segmentation mask given a complex and implicit query text.
We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
arXiv Detail & Related papers (2023-08-01T17:50:17Z) - Subword Segmental Language Modelling for Nguni Languages [7.252933737829635]
Subword segmental language model (SSLM) learns how to segment words while being trained for autoregressive language modelling.
We train our model on the 4 Nguni languages of South Africa.
Our results show that learning subword segmentation is an effective alternative to existing subword segmenters.
arXiv Detail & Related papers (2022-10-12T18:41:00Z) - Part-of-Speech Tagging of Odia Language Using statistical and Deep
Learning-Based Approaches [0.0]
This research work is to present a conditional random field (CRF) and deep learning-based approaches (CNN and Bi-LSTM) to develop Odia part-of-speech tagger.
It has been observed that Bi-LSTM model with character sequence feature and pre-trained word vector achieved a significant state-of-the-art result.
arXiv Detail & Related papers (2022-07-07T12:15:23Z) - Between words and characters: A Brief History of Open-Vocabulary
Modeling and Tokenization in NLP [22.772546707304766]
We show how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated.
We conclude that there is and likely will never be a silver bullet singular solution for all applications.
arXiv Detail & Related papers (2021-12-20T13:04:18Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - Improving Chinese Segmentation-free Word Embedding With Unsupervised
Association Measure [3.9435648520559177]
segmentation-free word embedding model is proposed by collecting n-grams vocabulary via a novel unsupervised association measure called pointwise association with times information(PATI)
The proposed method leverages more latent information from the corpus and thus is able to collect more valid n-grams that have stronger cohesion as embedding targets in unsegmented language data, such as Chinese texts.
arXiv Detail & Related papers (2020-07-05T13:55:19Z) - Neural Data-to-Text Generation via Jointly Learning the Segmentation and
Correspondence [48.765579605145454]
We propose to explicitly segment target text into fragment units and align them with their data correspondences.
The resulting architecture maintains the same expressive power as neural attention models.
On both E2E and WebNLG benchmarks, we show the proposed model consistently outperforms its neural attention counterparts.
arXiv Detail & Related papers (2020-05-03T14:28:28Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z) - Incorporating Uncertain Segmentation Information into Chinese NER for
Social Media Text [18.455836845989523]
segmentation error propagation is a challenge for Chinese named entity recognition systems.
We propose a model (UIcwsNN) that specializes in identifying entities from Chinese social media text.
arXiv Detail & Related papers (2020-04-14T09:39:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.