Related papers: HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

URL: http://arxiv.org/abs/2201.01956v1
Date: Thu, 6 Jan 2022 07:49:45 GMT
Title: HuSpaCy: an industrial-strength Hungarian natural language processing toolkit
Authors: Gy\"orgy Orosz, Zsolt Sz\'ant\'o, P\'eter Berkecz, Gerg\H{o} Szab\'o, Rich\'ard Farkas
Abstract summary: A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings. This paper introduces HuSpaCy, an industryready Hungarian language processing pipeline.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications. A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings. Industrial text processing applications have to satisfy non-functional software quality requirements, what is more, frameworks supporting multiple languages are more and more favored. This paper introduces HuSpaCy, an industryready Hungarian language processing pipeline. The presented tool provides components for the most important basic linguistic analysis tasks. It is open-source and is available under a permissive license. Our system is built upon spaCy's NLP components which means that it is fast, has a rich ecosystem of NLP applications and extensions, comes with extensive documentation and a well-known API. Besides the overview of the underlying models, we also present rigorous evaluation on common benchmark datasets. Our experiments confirm that HuSpaCy has high accuracy in all subtasks while maintaining resource-efficient prediction capabilities.

Related papers

Exploring NLP Benchmarks in an Extremely Low-Resource Setting [21.656551146954587]
This paper focuses on Ladin, an endangered Romance language, specifically targeting the Val Badia variant.<n>We create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data.
arXiv Detail & Related papers (2025-09-04T07:41:23Z)
Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction [0.0]
This dissertation develops, applying, and analyzing a methodology to enrich Portuguese news corpora with external evidence.<n>The approach simulates a user's verification process, employing Large Language Models (LLMs) to extract the main claim from texts.<n>A data validation and preprocessing framework, including near-duplicate detection, is introduced to enhance the quality of the base corpora.
arXiv Detail & Related papers (2025-07-19T23:46:40Z)
Langformers: Unified NLP Pipelines for Language Models [3.690904966341072]
Langformers is an open-source Python library designed to streamline NLP pipelines. It integrates conversational AI, pretraining, text classification, sentence embedding/reranking, data labelling, semantic search, and knowledge distillation into a cohesive API.
arXiv Detail & Related papers (2025-04-12T10:17:49Z)
CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models. CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines [0.0]
This paper presents a set of industrial-grade text processing models for Hungarian. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit. All experiments are reproducible and the pipelines are freely available under a permissive license.
arXiv Detail & Related papers (2023-08-24T08:19:51Z)
Evaluating Embedding APIs for Information Retrieval [51.24236853841468]
We evaluate the capabilities of existing semantic embedding APIs on domain generalization and multilingual retrieval. We find that re-ranking BM25 results using the APIs is a budget-friendly approach and is most effective in English. For non-English retrieval, re-ranking still improves the results, but a hybrid model with BM25 works best, albeit at a higher cost.
arXiv Detail & Related papers (2023-05-10T16:40:52Z)
ANGLEr: A Next-Generation Natural Language Exploratory Framework [0.0]
The proposed design is being used for implementation of a new natural language processing framework, called ANGLEr. The main parts of the proposed framework consist of (a) a pluggable Docker-based architecture, (b) a general data model, and (c) APIs description along with the graphical user interface.
arXiv Detail & Related papers (2022-05-10T13:32:13Z)
Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for Underdocumented Languages [6.8708103492634836]
Hundreds of underserved languages have available data sources in the form of interlinear glossed text (IGT) from language documentation efforts. We make the case that IGT data can be leveraged successfully provided that target language expertise is available. We illustrate each step through a case study on developing a morphological reinflection system for the Tsimchianic language Gitksan.
arXiv Detail & Related papers (2022-03-17T22:02:25Z)
Leveraging Language to Learn Program Abstractions and Search Heuristics [66.28391181268645]
We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis. When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization.
arXiv Detail & Related papers (2021-06-18T15:08:47Z)
Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources. Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages. We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z)
How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages [1.7625363344837164]
We introduce the first cross-lingual information extraction pipeline for Sumerian. We also curate InterpretLR, an interpretability toolkit for low-resource NLP. Most components of our pipeline can be generalised to any other language to obtain an interpretable execution.
arXiv Detail & Related papers (2021-05-30T12:09:59Z)
A Data-Centric Framework for Composable NLP Workflows [109.51144493023533]
Empirical natural language processing systems in application domains (e.g., healthcare, finance, education) involve interoperation among multiple components. We establish a unified open-source framework to support fast development of such sophisticated NLP in a composable manner.
arXiv Detail & Related papers (2021-03-02T16:19:44Z)
N-LTP: An Open-source Neural Language Technology Platform for Chinese [68.58732970171747]
textttN- is an open-source neural language technology platform supporting six fundamental Chinese NLP tasks. textttN- adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks.
arXiv Detail & Related papers (2020-09-24T11:45:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.