Spanish Legalese Language Model and Corpora
- URL: http://arxiv.org/abs/2110.12201v1
- Date: Sat, 23 Oct 2021 12:06:51 GMT
- Title: Spanish Legalese Language Model and Corpora
- Authors: Asier Guti\'errez-Fandi\~no, Jordi Armengol-Estap\'e, Aitor
Gonzalez-Agirre, Marta Villegas
- Abstract summary: Legal slang could be think of a Spanish variant on its own as it is very complicated in vocabulary, semantics and phrase understanding.
For this work we gathered legal-domain corpora from different sources, generated a model and evaluated against Spanish general domain tasks.
- Score: 0.0629976670819788
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There are many Language Models for the English language according to its
worldwide relevance. However, for the Spanish language, even if it is a widely
spoken language, there are very few Spanish Language Models which result to be
small and too general. Legal slang could be think of a Spanish variant on its
own as it is very complicated in vocabulary, semantics and phrase
understanding. For this work we gathered legal-domain corpora from different
sources, generated a model and evaluated against Spanish general domain tasks.
The model provides reasonable results in those tasks.
Related papers
- MEL: Legal Spanish Language Model [0.3651422140724638]
This paper presents the development and evaluation of MEL, a legal language model based on XLM-RoBERTa-large.
Evaluation benchmarks show a significant improvement over baseline models in understanding the legal Spanish language.
arXiv Detail & Related papers (2025-01-27T12:50:10Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - The Less the Merrier? Investigating Language Representation in
Multilingual Models [8.632506864465501]
We investigate the linguistic representation of different languages in multilingual models.
We observe from our experiments that community-centered models perform better at distinguishing between languages in the same family for low-resource languages.
arXiv Detail & Related papers (2023-10-20T02:26:34Z) - Lost in Translation: Large Language Models in Non-English Content
Analysis [0.0]
Large language models have become the dominant approach for building AI systems to analyze and generate language online.
Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English.
arXiv Detail & Related papers (2023-06-12T19:10:47Z) - Lessons learned from the evaluation of Spanish Language Models [27.653133576469276]
We present a head-to-head comparison of language models for Spanish with the following results.
We argue for the need of more research to understand the factors underlying them.
The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem.
arXiv Detail & Related papers (2022-12-16T10:33:38Z) - Language Models are Multilingual Chain-of-Thought Reasoners [83.37148309771378]
We introduce the Multilingual Grade School Math benchmark, by manually translating 250 grade-school math problems into ten typologically diverse languages.
We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale.
We show that the multilingual reasoning abilities of language models extend to other tasks such as commonsense reasoning and word-in-context semantic judgment.
arXiv Detail & Related papers (2022-10-06T17:03:34Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Evaluation Benchmarks for Spanish Sentence Representations [24.162683655834847]
We introduce Spanish SentEval and Spanish DiscoEval, aiming to assess the capabilities of stand-alone and discourse-aware sentence representations.
In addition, we evaluate and analyze the most recent pre-trained Spanish language models to exhibit their capabilities and limitations.
arXiv Detail & Related papers (2022-04-15T17:53:05Z) - Do Multilingual Language Models Capture Differing Moral Norms? [71.52261949766101]
Massively multilingual sentence representations are trained on large corpora of uncurated data.
This may cause the models to grasp cultural values including moral judgments from the high-resource languages.
The lack of data in certain languages can also lead to developing random and thus potentially harmful beliefs.
arXiv Detail & Related papers (2022-03-18T12:26:37Z) - A large scale lexical and semantic analysis of Spanish language
variations in Twitter [2.3511629321667096]
This manuscript presents a broad analysis describing lexical and semantic relationships among 26 Spanish-speaking countries around the globe.
We analyze four-year of the Twitter geotagged public stream to provide an extensive survey of the Spanish language vocabularies of different countries.
arXiv Detail & Related papers (2021-10-12T16:21:03Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.