HuSpaCy: an industrial-strength Hungarian natural language processing
toolkit
- URL: http://arxiv.org/abs/2201.01956v1
- Date: Thu, 6 Jan 2022 07:49:45 GMT
- Title: HuSpaCy: an industrial-strength Hungarian natural language processing
toolkit
- Authors: Gy\"orgy Orosz, Zsolt Sz\'ant\'o, P\'eter Berkecz, Gerg\H{o} Szab\'o,
Rich\'ard Farkas
- Abstract summary: A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings.
This paper introduces HuSpaCy, an industryready Hungarian language processing pipeline.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Although there are a couple of open-source language processing pipelines
available for Hungarian, none of them satisfies the requirements of today's NLP
applications. A language processing pipeline should consist of close to
state-of-the-art lemmatization, morphosyntactic analysis, entity recognition
and word embeddings. Industrial text processing applications have to satisfy
non-functional software quality requirements, what is more, frameworks
supporting multiple languages are more and more favored. This paper introduces
HuSpaCy, an industryready Hungarian language processing pipeline. The presented
tool provides components for the most important basic linguistic analysis
tasks. It is open-source and is available under a permissive license. Our
system is built upon spaCy's NLP components which means that it is fast, has a
rich ecosystem of NLP applications and extensions, comes with extensive
documentation and a well-known API. Besides the overview of the underlying
models, we also present rigorous evaluation on common benchmark datasets. Our
experiments confirm that HuSpaCy has high accuracy in all subtasks while
maintaining resource-efficient prediction capabilities.
Related papers
- CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models.
CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate
NLP Pipelines [0.0]
This paper presents a set of industrial-grade text processing models for Hungarian.
Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit.
All experiments are reproducible and the pipelines are freely available under a permissive license.
arXiv Detail & Related papers (2023-08-24T08:19:51Z) - Evaluating Embedding APIs for Information Retrieval [51.24236853841468]
We evaluate the capabilities of existing semantic embedding APIs on domain generalization and multilingual retrieval.
We find that re-ranking BM25 results using the APIs is a budget-friendly approach and is most effective in English.
For non-English retrieval, re-ranking still improves the results, but a hybrid model with BM25 works best, albeit at a higher cost.
arXiv Detail & Related papers (2023-05-10T16:40:52Z) - ANGLEr: A Next-Generation Natural Language Exploratory Framework [0.0]
The proposed design is being used for implementation of a new natural language processing framework, called ANGLEr.
The main parts of the proposed framework consist of (a) a pluggable Docker-based architecture, (b) a general data model, and (c) APIs description along with the graphical user interface.
arXiv Detail & Related papers (2022-05-10T13:32:13Z) - Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for
Underdocumented Languages [6.8708103492634836]
Hundreds of underserved languages have available data sources in the form of interlinear glossed text (IGT) from language documentation efforts.
We make the case that IGT data can be leveraged successfully provided that target language expertise is available.
We illustrate each step through a case study on developing a morphological reinflection system for the Tsimchianic language Gitksan.
arXiv Detail & Related papers (2022-03-17T22:02:25Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - How Low is Too Low? A Computational Perspective on Extremely
Low-Resource Languages [1.7625363344837164]
We introduce the first cross-lingual information extraction pipeline for Sumerian.
We also curate InterpretLR, an interpretability toolkit for low-resource NLP.
Most components of our pipeline can be generalised to any other language to obtain an interpretable execution.
arXiv Detail & Related papers (2021-05-30T12:09:59Z) - A Data-Centric Framework for Composable NLP Workflows [109.51144493023533]
Empirical natural language processing systems in application domains (e.g., healthcare, finance, education) involve interoperation among multiple components.
We establish a unified open-source framework to support fast development of such sophisticated NLP in a composable manner.
arXiv Detail & Related papers (2021-03-02T16:19:44Z) - N-LTP: An Open-source Neural Language Technology Platform for Chinese [68.58732970171747]
textttN- is an open-source neural language technology platform supporting six fundamental Chinese NLP tasks.
textttN- adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks.
arXiv Detail & Related papers (2020-09-24T11:45:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.