A study of Vietnamese readability assessing through semantic and statistical features
- URL: http://arxiv.org/abs/2411.04756v1
- Date: Thu, 07 Nov 2024 14:54:42 GMT
- Title: A study of Vietnamese readability assessing through semantic and statistical features
- Authors: Hung Tuan Le, Long Truong To, Manh Trong Nguyen, Quyen Nguyen, Trong-Hop Do,
- Abstract summary: This paper introduces a new approach that integrates statistical and semantic approaches to assessing text readability.
Our research utilized three distinct datasets: the Vietnamese Text Readability dataset (ViRead), OneStopEnglish, and RACE.
We conducted experiments using various machine learning models, including Support Vector Machine (SVM), Random Forest, and Extra Trees.
- Score: 0.0
- License:
- Abstract: Determining the difficulty of a text involves assessing various textual features that may impact the reader's text comprehension, yet current research in Vietnamese has only focused on statistical features. This paper introduces a new approach that integrates statistical and semantic approaches to assessing text readability. Our research utilized three distinct datasets: the Vietnamese Text Readability Dataset (ViRead), OneStopEnglish, and RACE, with the latter two translated into Vietnamese. Advanced semantic analysis methods were employed for the semantic aspect using state-of-the-art language models such as PhoBERT, ViDeBERTa, and ViBERT. In addition, statistical methods were incorporated to extract syntactic and lexical features of the text. We conducted experiments using various machine learning models, including Support Vector Machine (SVM), Random Forest, and Extra Trees and evaluated their performance using accuracy and F1 score metrics. Our results indicate that a joint approach that combines semantic and statistical features significantly enhances the accuracy of readability classification compared to using each method in isolation. The current study emphasizes the importance of considering both statistical and semantic aspects for a more accurate assessment of text difficulty in Vietnamese. This contribution to the field provides insights into the adaptability of advanced language models in the context of Vietnamese text readability. It lays the groundwork for future research in this area.
Related papers
- Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - Persian Homograph Disambiguation: Leveraging ParsBERT for Enhanced Sentence Understanding with a Novel Word Disambiguation Dataset [0.0]
We introduce a novel dataset tailored for Persian homograph disambiguation.
Our work encompasses a thorough exploration of various embeddings, evaluated through the cosine similarity method.
We scrutinize the models' performance in terms of Accuracy, Recall, and F1 Score.
arXiv Detail & Related papers (2024-05-24T14:56:36Z) - How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored.
Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges.
We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z) - Comprehending Lexical and Affective Ontologies in the Demographically
Diverse Spatial Social Media Discourse [0.0]
This study aims to comprehend linguistic and socio-demographic features, encompassing English language styles, conveyed sentiments, and lexical diversity within social media data.
Our analysis entails the extraction and examination of various statistical, grammatical, and sentimental features from two groups.
Our investigation unveils substantial disparities in certain linguistic attributes between the two groups, yielding a macro F1 score of approximately 0.85.
arXiv Detail & Related papers (2023-11-12T04:23:33Z) - Interpretable multimodal sentiment analysis based on textual modality
descriptions by using large-scale language models [1.4213973379473654]
Multimodal sentiment analysis is an important area for understanding the user's internal states.
Previous works have attempted to use attention weights or vector distributions to provide interpretability.
This study proposed a novel approach to provide interpretability by converting nonverbal modalities into text descriptions.
arXiv Detail & Related papers (2023-05-07T06:48:06Z) - An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP.
We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - TFW2V: An Enhanced Document Similarity Method for the Morphologically
Rich Finnish Language [0.5801044612920816]
This study focuses on experimenting with some of the current approaches to Finnish, which is a morphologically rich language.
We propose a simple method, TFW2V, which shows high efficiency in handling both long text documents and limited amounts of data.
arXiv Detail & Related papers (2021-12-23T12:27:45Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Enhancing lexical-based approach with external knowledge for Vietnamese
multiple-choice machine reading comprehension [2.5199066832791535]
We construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts.
We propose a lexical-based MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text.
Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model.
arXiv Detail & Related papers (2020-01-16T08:09:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.