Evaluating Sentence Segmentation and Word Tokenization Systems on
Estonian Web Texts
- URL: http://arxiv.org/abs/2011.07868v1
- Date: Mon, 16 Nov 2020 11:13:41 GMT
- Title: Evaluating Sentence Segmentation and Word Tokenization Systems on
Estonian Web Texts
- Authors: Kairit Sirts and Kairit Peekman
- Abstract summary: We first describe the manual annotation of sentence boundaries of an Estonian web dataset.
We then present the evaluation results of three existing sentence segmentation and word tokenization systems on this corpus.
- Score: 0.533024001730262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Texts obtained from web are noisy and do not necessarily follow the
orthographic sentence and word boundary rules. Thus, sentence segmentation and
word tokenization systems that have been developed on well-formed texts might
not perform so well on unedited web texts. In this paper, we first describe the
manual annotation of sentence boundaries of an Estonian web dataset and then
present the evaluation results of three existing sentence segmentation and word
tokenization systems on this corpus: EstNLTK, Stanza and UDPipe. While EstNLTK
obtains the highest performance compared to other systems on sentence
segmentation on this dataset, the sentence segmentation performance of Stanza
and UDPipe remains well below the results obtained on the more well-formed
Estonian UD test set.
Related papers
- Grammar Induction from Visual, Speech and Text [91.98797120799227]
This work introduces a novel visual-audio-text grammar induction task (textbfVAT-GI)
Inspired by the fact that language grammar exists beyond the texts, we argue that the text has not to be the predominant modality in grammar induction.
We propose a visual-audio-text inside-outside autoencoder (textbfVaTiora) framework, which leverages rich modal-specific and complementary features for effective grammar parsing.
arXiv Detail & Related papers (2024-10-01T02:24:18Z) - An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks [2.3624125155742064]
We propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources.
We design a preprocessing pipeline for the filtration of unwanted text from crawled data.
The cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms.
arXiv Detail & Related papers (2024-08-28T11:36:29Z) - Fusion approaches for emotion recognition from speech using acoustic and text-based features [15.186937600119897]
We study different approaches for classifying emotions from speech using acoustic and text-based features.
We compare strategies to combine the audio and text modalities, evaluating them on IEMOCAP and MSP-PODCAST datasets.
For IEMOCAP, we show the large effect that the criteria used to define the cross-validation folds have on results.
arXiv Detail & Related papers (2024-03-27T14:40:25Z) - Identifying Context-Dependent Translations for Evaluation Set Production [11.543673351369183]
A major impediment to the transition to context-aware machine translation is the absence of good evaluation metrics and test sets.
We produce CTXPRO, a tool that identifies subsets of parallel documents containing sentences that require context to translate five phenomena.
The input to the pipeline is a set of hand-crafted, per-language, linguistically-informed rules that select contextual sentence pairs.
arXiv Detail & Related papers (2023-11-04T04:29:08Z) - Sentiment-Aware Word and Sentence Level Pre-training for Sentiment
Analysis [64.70116276295609]
SentiWSP is a Sentiment-aware pre-trained language model with combined Word-level and Sentence-level Pre-training tasks.
SentiWSP achieves new state-of-the-art performance on various sentence-level and aspect-level sentiment classification benchmarks.
arXiv Detail & Related papers (2022-10-18T12:25:29Z) - InfoCSE: Information-aggregated Contrastive Learning of Sentence
Embeddings [61.77760317554826]
This paper proposes an information-d contrastive learning framework for learning unsupervised sentence embeddings, termed InfoCSE.
We evaluate the proposed InfoCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task.
Experimental results show that InfoCSE outperforms SimCSE by an average Spearman correlation of 2.60% on BERT-base, and 1.77% on BERT-large.
arXiv Detail & Related papers (2022-10-08T15:53:19Z) - Non-Parametric Domain Adaptation for End-to-End Speech Translation [72.37869362559212]
End-to-End Speech Translation (E2E-ST) has received increasing attention due to the potential of its less error propagation, lower latency, and fewer parameters.
We propose a novel non-parametric method that leverages domain-specific text translation corpus to achieve domain adaptation for the E2E-ST system.
arXiv Detail & Related papers (2022-05-23T11:41:02Z) - Example-Based Machine Translation from Text to a Hierarchical
Representation of Sign Language [1.3999481573773074]
This article presents an original method for Text-to-Sign Translation.
It compensates data scarcity using a domain-specific parallel corpus of alignments between text and hierarchical formal descriptions of Sign Language videos in AZee.
Based on the detection of similarities present in the source text, the proposed algorithm exploits matches and substitutions of aligned segments to build multiple candidate translations.
The resulting translations are in the form of AZee expressions, designed to be used as input to avatar systems.
arXiv Detail & Related papers (2022-05-06T15:48:43Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.