Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate
NLP Pipelines
- URL: http://arxiv.org/abs/2308.12635v1
- Date: Thu, 24 Aug 2023 08:19:51 GMT
- Title: Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate
NLP Pipelines
- Authors: Gy\"orgy Orosz and Gerg\H{o} Szab\'o and P\'eter Berkecz and Zsolt
Sz\'ant\'o and Rich\'ard Farkas
- Abstract summary: This paper presents a set of industrial-grade text processing models for Hungarian.
Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit.
All experiments are reproducible and the pipelines are freely available under a permissive license.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper presents a set of industrial-grade text processing models for
Hungarian that achieve near state-of-the-art performance while balancing
resource efficiency and accuracy. Models have been implemented in the spaCy
framework, extending the HuSpaCy toolkit with several improvements to its
architecture. Compared to existing NLP tools for Hungarian, all of our
pipelines feature all basic text processing steps including tokenization,
sentence-boundary detection, part-of-speech tagging, morphological feature
tagging, lemmatization, dependency parsing and named entity recognition with
high accuracy and throughput. We thoroughly evaluated the proposed
enhancements, compared the pipelines with state-of-the-art tools and
demonstrated the competitive performance of the new models in all text
preprocessing steps. All experiments are reproducible and the pipelines are
freely available under a permissive license.
Related papers
- CELA: Cost-Efficient Language Model Alignment for CTR Prediction [71.85120354973073]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.
Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)
We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z) - Pipeline and Dataset Generation for Automated Fact-checking in Almost
Any Language [0.0]
This article presents a pipeline for automated fact-checking leveraging publicly available Language Models and data.
The pipeline consists of two main modules -- the evidence retrieval and the claim veracity evaluation.
We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines.
arXiv Detail & Related papers (2023-12-15T19:43:41Z) - Neural Token Segmentation for High Token-Internal Complexity [7.569526565230962]
Tokenizing raw texts into word units is an essential pre-processing step for NLP pipelines.
We propose a novel neural segmentation model which combines contextualised token representation and char-level decoding.
Our model shows substantial improvements in segmentation accuracy on Hebrew and Arabic compared to the state-of-the-art.
arXiv Detail & Related papers (2022-03-21T10:07:17Z) - Real-Time Scene Text Detection with Differentiable Binarization and
Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field.
We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network.
An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z) - HuSpaCy: an industrial-strength Hungarian natural language processing
toolkit [0.0]
A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings.
This paper introduces HuSpaCy, an industryready Hungarian language processing pipeline.
arXiv Detail & Related papers (2022-01-06T07:49:45Z) - DaCy: A Unified Framework for Danish NLP [1.2891210250935146]
We present DaCy: a unified framework for Danish NLP built on SpaCy.
DaCy uses efficient models which obtain state-of-the-art performance on named entity recognition, part-of-speech tagging, and dependency parsing.
We conduct a series of tests for biases and robustness of Danish NLP pipelines through augmentation of the test set of DaNE.
arXiv Detail & Related papers (2021-07-12T10:14:31Z) - Data Augmentation in Natural Language Processing: A Novel Text
Generation Approach for Long and Short Text Classifiers [8.19984844136462]
We present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts.
In a simulated low data regime additive accuracy gains of up to 15.53% are achieved.
We discuss implications and patterns for the successful application of our approach on different types of datasets.
arXiv Detail & Related papers (2021-03-26T13:16:07Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z) - ESPnet-ST: All-in-One Speech Translation Toolkit [57.76342114226599]
ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet.
It implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation.
We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines.
arXiv Detail & Related papers (2020-04-21T18:38:38Z) - Coreferential Reasoning Learning for Language Representation [88.14248323659267]
We present CorefBERT, a novel language representation model that can capture the coreferential relations in context.
The experimental results show that, compared with existing baseline models, CorefBERT can achieve significant improvements consistently on various downstream NLP tasks.
arXiv Detail & Related papers (2020-04-15T03:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.