TransLIST: A Transformer-Based Linguistically Informed Sanskrit
Tokenizer
- URL: http://arxiv.org/abs/2210.11753v1
- Date: Fri, 21 Oct 2022 06:15:40 GMT
- Title: TransLIST: A Transformer-Based Linguistically Informed Sanskrit
Tokenizer
- Authors: Jivnesh Sandhan, Rathin Singha, Narein Rao, Suvendu Samanta, Laxmidhar
Behera and Pawan Goyal
- Abstract summary: Sanskrit Word algorithm (SWS) is essential in making digitized texts available and in deploying downstream tasks.
We propose Transformer based Linguistically Informed Sanskrit Tokenizer (TransLIST)
TransLIST encodes the character input along with latent-word information, which takes into account the sandhi phenomenon specific to SWS.
- Score: 11.608920658638976
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sanskrit Word Segmentation (SWS) is essential in making digitized texts
available and in deploying downstream tasks. It is, however, non-trivial
because of the sandhi phenomenon that modifies the characters at the word
boundaries, and needs special treatment. Existing lexicon driven approaches for
SWS make use of Sanskrit Heritage Reader, a lexicon-driven shallow parser, to
generate the complete candidate solution space, over which various methods are
applied to produce the most valid solution. However, these approaches fail
while encountering out-of-vocabulary tokens. On the other hand, purely
engineering methods for SWS have made use of recent advances in deep learning,
but cannot make use of the latent word information on availability.
To mitigate the shortcomings of both families of approaches, we propose
Transformer based Linguistically Informed Sanskrit Tokenizer (TransLIST)
consisting of (1) a module that encodes the character input along with
latent-word information, which takes into account the sandhi phenomenon
specific to SWS and is apt to work with partial or no candidate solutions, (2)
a novel soft-masked attention to prioritize potential candidate words and (3) a
novel path ranking algorithm to rectify the corrupted predictions. Experiments
on the benchmark datasets for SWS show that TransLIST outperforms the current
state-of-the-art system by an average 7.2 points absolute gain in terms of
perfect match (PM) metric. The codebase and datasets are publicly available at
https://github.com/rsingha108/TransLIST
Related papers
- CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation [39.08623113730563]
Subword tokens in Indian languages inherently carry meaning, and isolating them can enhance NLP tasks.
We propose a new approach of utilizing a Character-level Transformer model for Sanskrit Word (CharSS)
We perform experiments on three benchmark datasets to compare the performance of our method against existing methods.
arXiv Detail & Related papers (2024-07-08T18:50:13Z) - TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data [50.40191599304911]
We propose Transliterate transliteration-Merge (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script.
Results show a consistent improvement of 3% to 34%, varying across different models and tasks.
arXiv Detail & Related papers (2024-05-16T09:08:09Z) - Provably Secure Disambiguating Neural Linguistic Steganography [66.30965740387047]
The segmentation ambiguity problem, which arises when using language models based on subwords, leads to occasional decoding failures.
We propose a novel secure disambiguation method named SyncPool, which effectively addresses the segmentation ambiguity problem.
SyncPool does not change the size of the candidate pool or the distribution of tokens and thus is applicable to provably secure language steganography methods.
arXiv Detail & Related papers (2024-03-26T09:25:57Z) - Integrating Bidirectional Long Short-Term Memory with Subword Embedding
for Authorship Attribution [2.3429306644730854]
Manifold word-based stylistic markers have been successfully used in deep learning methods to deal with the intrinsic problem of authorship attribution.
The proposed method was experimentally evaluated against numerous state-of-the-art methods across the public corporal of CCAT50, IMDb62, Blog50, and Twitter50.
arXiv Detail & Related papers (2023-06-26T11:35:47Z) - Mining Word Boundaries in Speech as Naturally Annotated Word
Segmentation Data [41.494096583913105]
Inspired by early research on exploring naturally annotated data for Chinese word segmentation (CWS), this work proposes to mine word boundaries from parallel speech/text data.
First we collect parallel speech/text data from two Internet sources that are related with CWS data used in our experiments.
We obtain character-level alignments and design simple rules for determining word boundaries according to pause duration between adjacent characters.
arXiv Detail & Related papers (2022-10-31T08:02:21Z) - A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task
Learning [8.052271364177988]
Subword tokenization is a commonly used input pre-processing step in most recent NLP models.
We propose a vocabulary-free neural tokenizer by distilling segmentation information from subword tokenization.
Our tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks.
arXiv Detail & Related papers (2022-04-22T16:50:49Z) - Between words and characters: A Brief History of Open-Vocabulary
Modeling and Tokenization in NLP [22.772546707304766]
We show how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated.
We conclude that there is and likely will never be a silver bullet singular solution for all applications.
arXiv Detail & Related papers (2021-12-20T13:04:18Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Depth-Adaptive Graph Recurrent Network for Text Classification [71.20237659479703]
Sentence-State LSTM (S-LSTM) is a powerful and high efficient graph recurrent network.
We propose a depth-adaptive mechanism for the S-LSTM, which allows the model to learn how many computational steps to conduct for different words as required.
arXiv Detail & Related papers (2020-02-29T03:09:55Z) - TextScanner: Reading Characters in Order for Robust Scene Text
Recognition [60.04267660533966]
TextScanner is an alternative approach for scene text recognition.
It generates pixel-wise, multi-channel segmentation maps for character class, position and order.
It also adopts RNN for context modeling and performs paralleled prediction for character position and class.
arXiv Detail & Related papers (2019-12-28T07:52:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.