Related papers: Morphological Analysis of Japanese Hiragana Sentences using the BI-LSTM CRF Model

Morphological Analysis of Japanese Hiragana Sentences using the BI-LSTM CRF Model

URL: http://arxiv.org/abs/2201.03366v1
Date: Mon, 10 Jan 2022 14:36:06 GMT
Title: Morphological Analysis of Japanese Hiragana Sentences using the BI-LSTM CRF Model
Authors: Jun Izutsu and Kanako Komiya
Abstract summary: This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. This technique plays an essential role in downstream applications in Japanese natural language processing systems because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.

Related papers

Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus [0.0]
We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags. We find indications that the "dialect effect" produced by intentional orthographic variation employs multiple linguistic channels.
arXiv Detail & Related papers (2024-10-03T16:58:21Z)
Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models [17.749113496737106]
We construct the first Classical-Chinese-to-Kanbun dataset in the world. Character reordering and machine translation play a significant role in Kanbun comprehension. We release our code and dataset on GitHub.
arXiv Detail & Related papers (2023-05-22T06:30:02Z)
Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods. We conducted three types of experiments -- monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z)
Modeling Target-Side Morphology in Neural Machine Translation: A Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation. A large amount of differently inflected word surface forms entails a larger vocabulary. Some inflected forms of infrequent terms typically do not appear in the training corpus. Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z)
Predicting the Ordering of Characters in Japanese Historical Documents [6.82324732276004]
Change in Japanese writing system in 1900 made historical documents inaccessible for the general public. We explore a few approaches to the task of predicting the sequential ordering of the characters. Our best-performing system has an accuracy of 98.65% and has a perfect accuracy on 49% of the books in our dataset.
arXiv Detail & Related papers (2021-06-12T14:39:20Z)
Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis. We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z)
Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text. We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z)
LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching [29.318730227080675]
We introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity. Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches.
arXiv Detail & Related papers (2021-02-25T04:01:51Z)
Generating Adversarial Examples in Chinese Texts Using Sentence-Pieces [60.58900627906269]
We propose a pre-train language model as the substitutes generator using sentence-pieces to craft adversarial examples in Chinese. The substitutions in the generated adversarial examples are not characters or words but textit'pieces', which are more natural to Chinese readers.
arXiv Detail & Related papers (2020-12-29T14:28:07Z)
Inference-only sub-character decomposition improves translation of unseen logographic characters [18.148675498274866]
Neural Machine Translation (NMT) on logographic source languages struggles when translating unseen' characters. We investigate existing ideograph-based sub-character decomposition approaches for Chinese-to-English and Japanese-to-English NMT. We find that complete sub-character decomposition often harms unseen character translation, and gives inconsistent results generally.
arXiv Detail & Related papers (2020-11-12T17:36:22Z)
Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures. We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.