CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation
- URL: http://arxiv.org/abs/2407.06331v1
- Date: Mon, 8 Jul 2024 18:50:13 GMT
- Title: CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation
- Authors: Krishnakant Bhatt, Karthika N J, Ganesh Ramakrishnan, Preethi Jyothi,
- Abstract summary: Subword tokens in Indian languages inherently carry meaning, and isolating them can enhance NLP tasks.
We propose a new approach of utilizing a Character-level Transformer model for Sanskrit Word (CharSS)
We perform experiments on three benchmark datasets to compare the performance of our method against existing methods.
- Score: 39.08623113730563
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Subword tokens in Indian languages inherently carry meaning, and isolating them can enhance NLP tasks, making sub-word segmentation a crucial process. Segmenting Sanskrit and other Indian languages into subtokens is not straightforward, as it may include sandhi, which may lead to changes in the word boundaries. We propose a new approach of utilizing a Character-level Transformer model for Sanskrit Word Segmentation (CharSS). We perform experiments on three benchmark datasets to compare the performance of our method against existing methods. On the UoH+SandhiKosh dataset, our method outperforms the current state-of-the-art system by an absolute gain of 6.72 points in split prediction accuracy. On the hackathon dataset, our method achieves a gain of 2.27 points over the current SOTA system in terms of perfect match metric. We also propose a use-case of Sanskrit-based segments for a linguistically informed translation of technical terms to lexically similar low-resource Indian languages. In two separate experimental settings for this task, we achieve an average improvement of 8.46 and 6.79 chrF++ scores, respectively.
Related papers
- Lexically Grounded Subword Segmentation [0.0]
We present three innovations in tokenization and subword segmentation.
First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization.
Second, we present an method for obtaining subword embeddings grounded in a word embedding space.
Third, we introduce an efficient segmentation algorithm based on a subword bigram model.
arXiv Detail & Related papers (2024-06-19T13:48:19Z) - Character-level NMT and language similarity [1.90365714903665]
We explore the effectiveness of character-level neural machine translation for various levels of language similarity and size of the training dataset on translation between Czech and Croatian, German, Hungarian, Slovak, and Spanish.
We evaluate the models using automatic MT metrics and show that translation between similar languages benefits from character-level input segmentation.
We confirm previous findings that it is possible to close the gap by finetuning the already trained subword-level models to character-level.
arXiv Detail & Related papers (2023-08-08T17:01:42Z) - Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic
Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages.
Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points.
By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z) - TransLIST: A Transformer-Based Linguistically Informed Sanskrit
Tokenizer [11.608920658638976]
Sanskrit Word algorithm (SWS) is essential in making digitized texts available and in deploying downstream tasks.
We propose Transformer based Linguistically Informed Sanskrit Tokenizer (TransLIST)
TransLIST encodes the character input along with latent-word information, which takes into account the sandhi phenomenon specific to SWS.
arXiv Detail & Related papers (2022-10-21T06:15:40Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.