Japanese Lexical Complexity for Non-Native Readers: A New Dataset
- URL: http://arxiv.org/abs/2306.17399v1
- Date: Fri, 30 Jun 2023 04:37:43 GMT
- Title: Japanese Lexical Complexity for Non-Native Readers: A New Dataset
- Authors: Yusuke Ide, Masato Mita, Adam Nohejl, Hiroki Ouchi, Taro Watanabe
- Abstract summary: We construct the first Japanese lexical complexity dataset.
Our dataset provides separate complexity scores for Chinese/Korean annotators and others to address the readers' L1-specific needs.
In the baseline experiment, we demonstrate the effectiveness of a BERT-based system for Japanese LCP.
- Score: 17.435354337164807
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lexical complexity prediction (LCP) is the task of predicting the complexity
of words in a text on a continuous scale. It plays a vital role in simplifying
or annotating complex words to assist readers. To study lexical complexity in
Japanese, we construct the first Japanese LCP dataset. Our dataset provides
separate complexity scores for Chinese/Korean annotators and others to address
the readers' L1-specific needs. In the baseline experiment, we demonstrate the
effectiveness of a BERT-based system for Japanese LCP.
Related papers
- CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs [71.01843542502438]
We present a comprehensive benchmark for evaluating Chinese Large Language Models (CB-ECLLM)<n>CB-ECLLM is based on the newly constructed Chinese Data-Text Pair (CDTP) dataset.<n>CDTP comprises over 7 million aligned text pairs, each consisting of unstructured text coupled with one or more corresponding triples, alongside a total of 15 million triples spanning four critical domains.
arXiv Detail & Related papers (2025-10-07T15:33:52Z) - Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization [51.26060172682443]
We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules.<n>Our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation.
arXiv Detail & Related papers (2025-07-07T15:49:23Z) - EXECUTE: A Multilingual Benchmark for LLM Token Understanding [54.70665106141121]
Tests across multiple languages reveal that challenges in other languages are not always on the character level as in English.<n>We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs' understanding of character components.
arXiv Detail & Related papers (2025-05-23T11:56:48Z) - Difficult for Whom? A Study of Japanese Lexical Complexity [12.038720850970213]
We show that a recent Japanese LCP dataset is representative of its target population by partially replicating the annotation.
By another reannotation we show that native Chinese speakers perceive the complexity differently due to Sino-Japanese vocabulary.
We show that the model trained on a group mean performs similarly to an individual model in the CWI task, while achieving good LCP performance for an individual is difficult.
arXiv Detail & Related papers (2024-10-24T09:18:53Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - Decomposed Prompting: Probing Multilingual Linguistic Structure Knowledge in Large Language Models [54.58989938395976]
We introduce a decomposed prompting approach for sequence labeling tasks.<n>We test our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages.
arXiv Detail & Related papers (2024-02-28T15:15:39Z) - Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions.
Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions.
We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z) - A New Dataset and Empirical Study for Sentence Simplification in Chinese [50.0624778757462]
This paper introduces CSS, a new dataset for assessing sentence simplification in Chinese.
We collect manual simplifications from human annotators and perform data analysis to show the difference between English and Chinese sentence simplifications.
In the end, we explore whether Large Language Models can serve as high-quality Chinese sentence simplification systems by evaluating them on CSS.
arXiv Detail & Related papers (2023-06-07T06:47:34Z) - Lexical Complexity Prediction: An Overview [13.224233182417636]
The occurrence of unknown words in texts significantly hinders reading comprehension.
computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives.
We present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data.
arXiv Detail & Related papers (2023-03-08T19:35:08Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z) - Lexical Complexity Controlled Sentence Generation [6.298911438929862]
We introduce a novel task of lexical complexity controlled sentence generation.
It has enormous potential in domains such as grade reading, language teaching and acquisition.
We propose a simple but effective approach for this task based on complexity embedding.
arXiv Detail & Related papers (2022-11-26T11:03:56Z) - Improving Sign Language Translation with Monolingual Data by Sign
Back-Translation [105.83166521438463]
We propose a sign back-translation (SignBT) approach, which incorporates massive spoken language texts into sign training.
With a text-to-gloss translation model, we first back-translate the monolingual text to its gloss sequence.
Then, the paired sign sequence is generated by splicing pieces from an estimated gloss-to-sign bank at the feature level.
arXiv Detail & Related papers (2021-05-26T08:49:30Z) - LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for
Lexical Complexity Prediction [4.86331990243181]
This paper describes team LCP-RIT's submission to the SemEval-2021 Task 1: Lexical Complexity Prediction (LCP)
Our system uses logistic regression and a wide range of linguistic features to predict the complexity of single words in this dataset.
We evaluate the results in terms of mean absolute error, mean squared error, Pearson correlation, and Spearman correlation.
arXiv Detail & Related papers (2021-05-18T18:55:04Z) - Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual
Retrieval [51.60862829942932]
We present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved.
However, the peak performance is not met using the general-purpose multilingual text encoders off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
arXiv Detail & Related papers (2021-01-21T00:15:38Z) - CompLex: A New Corpus for Lexical Complexity Prediction from Likert
Scale Data [13.224233182417636]
This paper presents the first English dataset for continuous lexical complexity prediction.
We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts.
arXiv Detail & Related papers (2020-03-16T03:54:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.