OpenMSD: Towards Multilingual Scientific Documents Similarity
Measurement
- URL: http://arxiv.org/abs/2309.10539v1
- Date: Tue, 19 Sep 2023 11:38:39 GMT
- Title: OpenMSD: Towards Multilingual Scientific Documents Similarity
Measurement
- Authors: Yang Gao, Ji Ma, Ivan Korotkov, Keith Hall, Dana Alon, Don Metzler
- Abstract summary: We develop and evaluate multilingual scientific documents similarity measurement models in this work.
We propose the first multilingual scientific documents dataset, Open-access Multilingual Scientific Documents (OpenMSD), which has 74M papers in 103 languages and 778 citation pairs.
- Score: 11.602151258188862
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We develop and evaluate multilingual scientific documents similarity
measurement models in this work. Such models can be used to find related works
in different languages, which can help multilingual researchers find and
explore papers more efficiently. We propose the first multilingual scientific
documents dataset, Open-access Multilingual Scientific Documents (OpenMSD),
which has 74M papers in 103 languages and 778M citation pairs. With OpenMSD, we
pretrain science-specialized language models, and explore different strategies
to derive "related" paper pairs to fine-tune the models, including using a
mixture of citation, co-citation, and bibliographic-coupling pairs. To further
improve the models' performance for non-English papers, we explore the use of
generative language models to enrich the non-English papers with English
summaries. This allows us to leverage the models' English capabilities to
create better representations for non-English papers. Our best model
significantly outperforms strong baselines by 7-16% (in mean average
precision).
Related papers
- Since the Scientific Literature Is Multilingual, Our Models Should Be Too [8.039428445336364]
We show that the literature is largely multilingual and argue that current models and benchmarks should reflect this linguistic diversity.
We provide evidence that text-based models fail to create meaningful representations for non-English papers and highlight the negative user-facing impacts of using English-only models non-discriminately across a multilingual domain.
arXiv Detail & Related papers (2024-03-27T04:47:10Z) - GlossLM: Multilingual Pretraining for Low-Resource Interlinear Glossing [39.846419973203744]
We compile the largest existing corpus of interlinear glossed text (IGT) data from a variety of sources, covering over 450k examples across 1.8k languages.
We pretrain a large multilingual model on our corpus, outperforming SOTA models by up to 6.6%.
We will make our pretrained model and dataset available through Hugging Face, as well as provide access through a web interface for use in language documentation efforts.
arXiv Detail & Related papers (2024-03-11T03:21:15Z) - Towards Better Monolingual Japanese Retrievers with Multi-Vector Models [0.0]
In Japanese, the best performing deep-learning based retrieval approaches rely on multilingual dense embedders.
We introduce JaColBERT, a family of multi-vector retrievers trained on two magnitudes fewer data than their multilingual counterparts.
arXiv Detail & Related papers (2023-12-26T18:07:05Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - MIReAD: Simple Method for Learning High-quality Representations from
Scientific Documents [77.34726150561087]
We propose MIReAD, a simple method that learns high-quality representations of scientific papers.
We train MIReAD on more than 500,000 PubMed and arXiv abstracts across over 2,000 journal classes.
arXiv Detail & Related papers (2023-05-07T03:29:55Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - A Bayesian Multilingual Document Model for Zero-shot Topic Identification and Discovery [1.9215779751499527]
The model is an extension of BaySMM [Kesiraju et al 2020] to the multilingual scenario.
We propagate the learned uncertainties through linear classifiers that benefit zero-shot cross-lingual topic identification.
We revisit cross-lingual topic identification in zero-shot settings by taking a deeper dive into current datasets.
arXiv Detail & Related papers (2020-07-02T19:55:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.