TFW2V: An Enhanced Document Similarity Method for the Morphologically
Rich Finnish Language
- URL: http://arxiv.org/abs/2112.12489v1
- Date: Thu, 23 Dec 2021 12:27:45 GMT
- Title: TFW2V: An Enhanced Document Similarity Method for the Morphologically
Rich Finnish Language
- Authors: Quan Duong, Mika H\"am\"al\"ainen, Khalid Alnajjar
- Abstract summary: This study focuses on experimenting with some of the current approaches to Finnish, which is a morphologically rich language.
We propose a simple method, TFW2V, which shows high efficiency in handling both long text documents and limited amounts of data.
- Score: 0.5801044612920816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Measuring the semantic similarity of different texts has many important
applications in Digital Humanities research such as information retrieval,
document clustering and text summarization. The performance of different
methods depends on the length of the text, the domain and the language. This
study focuses on experimenting with some of the current approaches to Finnish,
which is a morphologically rich language. At the same time, we propose a simple
method, TFW2V, which shows high efficiency in handling both long text documents
and limited amounts of data. Furthermore, we design an objective evaluation
method which can be used as a framework for benchmarking text similarity
approaches.
Related papers
- A study of Vietnamese readability assessing through semantic and statistical features [0.0]
This paper introduces a new approach that integrates statistical and semantic approaches to assessing text readability.
Our research utilized three distinct datasets: the Vietnamese Text Readability dataset (ViRead), OneStopEnglish, and RACE.
We conducted experiments using various machine learning models, including Support Vector Machine (SVM), Random Forest, and Extra Trees.
arXiv Detail & Related papers (2024-11-07T14:54:42Z) - Multi-Dimensional Evaluation of Text Summarization with In-Context
Learning [79.02280189976562]
In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning.
Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization.
We then analyze the effects of factors such as the selection and number of in-context examples on performance.
arXiv Detail & Related papers (2023-06-01T23:27:49Z) - Uzbek text summarization based on TF-IDF [0.0]
This article presents an experiment on summarization task for Uzbek language.
The methodology was based on text abstracting based on TF-IDF algorithm.
We summarize the given text by applying the n-gram method to important parts of the whole text.
arXiv Detail & Related papers (2023-03-01T12:39:46Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - A Topological Method for Comparing Document Semantics [0.0]
We propose a novel algorithm for comparing semantics similarity between two documents.
Our experiments are conducted on a document dataset with human judges' results.
Our algorithm can produce highly human-consistent results, and also beats most state-of-the-art methods though ties with NLTK.
arXiv Detail & Related papers (2020-12-08T04:21:40Z) - Method of the coherence evaluation of Ukrainian text [0.0]
Methods for text coherence measurements for Ukrainian language are analyzed.
Training and examination procedures are made on the corpus of Ukrainian texts.
Test procedure is implemented by performing of two typical tasks for the text coherence assessment.
arXiv Detail & Related papers (2020-10-31T16:48:55Z) - MultiGBS: A multi-layer graph approach to biomedical summarization [6.11737116137921]
We propose a domain-specific method that models a document as a multi-layer graph to enable multiple features of the text to be processed at the same time.
The unsupervised method selects sentences from the multi-layer graph based on the MultiRank algorithm and the number of concepts.
The proposed MultiGBS algorithm employs UMLS and extracts the concepts and relationships using different tools such as SemRep, MetaMap, and OGER.
arXiv Detail & Related papers (2020-08-27T04:22:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.