Stanza: A Python Natural Language Processing Toolkit for Many Human
Languages
- URL: http://arxiv.org/abs/2003.07082v2
- Date: Thu, 23 Apr 2020 04:22:16 GMT
- Title: Stanza: A Python Natural Language Processing Toolkit for Many Human
Languages
- Authors: Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, Christopher D.
Manning
- Abstract summary: We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages.
Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging.
We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora.
- Score: 44.8226642800919
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Stanza, an open-source Python natural language processing
toolkit supporting 66 human languages. Compared to existing widely used
toolkits, Stanza features a language-agnostic fully neural pipeline for text
analysis, including tokenization, multi-word token expansion, lemmatization,
part-of-speech and morphological feature tagging, dependency parsing, and named
entity recognition. We have trained Stanza on a total of 112 datasets,
including the Universal Dependencies treebanks and other multilingual corpora,
and show that the same neural architecture generalizes well and achieves
competitive performance on all languages tested. Additionally, Stanza includes
a native Python interface to the widely used Java Stanford CoreNLP software,
which further extends its functionality to cover other tasks such as
coreference resolution and relation extraction. Source code, documentation, and
pretrained models for 66 languages are available at
https://stanfordnlp.github.io/stanza.
Related papers
- CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models.
CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - GlobalBench: A Benchmark for Global Progress in Natural Language
Processing [114.24519009839142]
GlobalBench aims to track progress on all NLP datasets in all languages.
Tracks estimated per-speaker utility and equity of technology across all languages.
Currently, GlobalBench covers 966 datasets in 190 languages, and has 1,128 system submissions spanning 62 languages.
arXiv Detail & Related papers (2023-05-24T04:36:32Z) - Examining Cross-lingual Contextual Embeddings with Orthogonal Structural
Probes [0.2538209532048867]
A novel Orthogonal Structural Probe (Limisiewicz and Marevcek, 2021) allows us to answer this question for specific linguistic features.
We evaluate syntactic (UD) and lexical (WordNet) structural information encoded inmBERT's contextual representations for nine diverse languages.
We successfully apply our findings to zero-shot and few-shot cross-lingual parsing.
arXiv Detail & Related papers (2021-09-10T15:03:11Z) - pysentimiento: A Python Toolkit for Opinion Mining and Social NLP tasks [0.2826977330147589]
pysentimiento is a Python toolkit designed for opinion mining and other Social NLP tasks.
This open-source library brings state-of-the-art models for Spanish, English, Italian, and Portuguese in an easy-to-use Python library.
We present a comprehensive assessment of performance for several pre-trained language models across a variety of tasks, languages, and datasets.
arXiv Detail & Related papers (2021-06-17T13:15:07Z) - When Word Embeddings Become Endangered [0.685316573653194]
We present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and translation dictionaries of resource-poor languages.
All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.
arXiv Detail & Related papers (2021-03-24T15:42:53Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - Improved acoustic word embeddings for zero-resource languages using
multilingual transfer [37.78342106714364]
We train a single supervised embedding model on labelled data from multiple well-resourced languages and apply it to unseen zero-resource languages.
We consider three multilingual recurrent neural network (RNN) models: a classifier trained on the joint vocabularies of all training languages; a Siamese RNN trained to discriminate between same and different words from multiple languages; and a correspondence autoencoder (CAE) RNN trained to reconstruct word pairs.
All of these models outperform state-of-the-art unsupervised models trained on the zero-resource languages themselves, giving relative improvements of more than 30% in average precision.
arXiv Detail & Related papers (2020-06-02T12:28:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.