Since the Scientific Literature Is Multilingual, Our Models Should Be Too
- URL: http://arxiv.org/abs/2403.18251v1
- Date: Wed, 27 Mar 2024 04:47:10 GMT
- Title: Since the Scientific Literature Is Multilingual, Our Models Should Be Too
- Authors: Abteen Ebrahimi, Kenneth Church,
- Abstract summary: We show that the literature is largely multilingual and argue that current models and benchmarks should reflect this linguistic diversity.
We provide evidence that text-based models fail to create meaningful representations for non-English papers and highlight the negative user-facing impacts of using English-only models non-discriminately across a multilingual domain.
- Score: 8.039428445336364
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: English has long been assumed the $\textit{lingua franca}$ of scientific research, and this notion is reflected in the natural language processing (NLP) research involving scientific document representation. In this position piece, we quantitatively show that the literature is largely multilingual and argue that current models and benchmarks should reflect this linguistic diversity. We provide evidence that text-based models fail to create meaningful representations for non-English papers and highlight the negative user-facing impacts of using English-only models non-discriminately across a multilingual domain. We end with suggestions for the NLP community on how to improve performance on non-English documents.
Related papers
- Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models [52.00446751692225]
We present a novel and simple yet effective method called textbfDictionary textbfInsertion textbfPrompting (textbfDIP)
When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the prompt for LLMs.
It then enables better translation into English and better English model thinking steps which leads to obviously better results.
arXiv Detail & Related papers (2024-11-02T05:10:50Z) - Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance [6.907734681124986]
This paper strategically identifies the need for linguistic equity by examining several knowledge editing techniques in multilingual contexts.
We evaluate the performance of models such as Mistral, TowerInstruct, OpenHathi, Tamil-Llama, and Kan-Llama across languages including English, German, French, Italian, Spanish, Hindi, Tamil, and Kannada.
arXiv Detail & Related papers (2024-06-17T01:54:27Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - OpenMSD: Towards Multilingual Scientific Documents Similarity
Measurement [11.602151258188862]
We develop and evaluate multilingual scientific documents similarity measurement models in this work.
We propose the first multilingual scientific documents dataset, Open-access Multilingual Scientific Documents (OpenMSD), which has 74M papers in 103 languages and 778 citation pairs.
arXiv Detail & Related papers (2023-09-19T11:38:39Z) - Multilingual Text Representation [3.4447129363520337]
Modern NLP breakthrough includes large multilingual models capable of performing tasks across more than 100 languages.
State-of-the-art language models came a long way, starting from the simple one-hot representation of words.
We discuss how the full potential of language democratization could be obtained, reaching beyond the known limits.
arXiv Detail & Related papers (2023-09-02T14:21:22Z) - Testing the Ability of Language Models to Interpret Figurative Language [69.59943454934799]
Figurative and metaphorical language are commonplace in discourse.
It remains an open question to what extent modern language models can interpret nonliteral phrases.
We introduce Fig-QA, a Winograd-style nonliteral language understanding task.
arXiv Detail & Related papers (2022-04-26T23:42:22Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - The Rediscovery Hypothesis: Language Models Need to Meet Linguistics [8.293055016429863]
We study whether linguistic knowledge is a necessary condition for good performance of modern language models.
We show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures.
This result supports the rediscovery hypothesis and leads to the second contribution of our paper: an information-theoretic framework that relates language modeling objective with linguistic information.
arXiv Detail & Related papers (2021-03-02T15:57:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.