Multilingual Text Representation
- URL: http://arxiv.org/abs/2309.00949v1
- Date: Sat, 2 Sep 2023 14:21:22 GMT
- Title: Multilingual Text Representation
- Authors: Fahim Faisal
- Abstract summary: Modern NLP breakthrough includes large multilingual models capable of performing tasks across more than 100 languages.
State-of-the-art language models came a long way, starting from the simple one-hot representation of words.
We discuss how the full potential of language democratization could be obtained, reaching beyond the known limits.
- Score: 3.4447129363520337
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern NLP breakthrough includes large multilingual models capable of
performing tasks across more than 100 languages. State-of-the-art language
models came a long way, starting from the simple one-hot representation of
words capable of performing tasks like natural language understanding,
common-sense reasoning, or question-answering, thus capturing both the syntax
and semantics of texts. At the same time, language models are expanding beyond
our known language boundary, even competitively performing over very
low-resource dialects of endangered languages. However, there are still
problems to solve to ensure an equitable representation of texts through a
unified modeling space across language and speakers. In this survey, we shed
light on this iterative progression of multilingual text representation and
discuss the driving factors that ultimately led to the current
state-of-the-art. Subsequently, we discuss how the full potential of language
democratization could be obtained, reaching beyond the known limits and what is
the scope of improvement in that space.
Related papers
- Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training [29.47243668154796]
BLOOMZMMS is a novel model that integrates a multilingual LLM with a multilingual speech encoder.
We demonstrate the transferability of linguistic knowledge from the text to the speech modality.
Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks.
arXiv Detail & Related papers (2024-04-16T21:45:59Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Multilingual Multi-Figurative Language Detection [14.799109368073548]
figurative language understanding is highly understudied in a multilingual setting.
We introduce multilingual multi-figurative language modelling, and provide a benchmark for sentence-level figurative language detection.
We develop a framework for figurative language detection based on template-based prompt learning.
arXiv Detail & Related papers (2023-05-31T18:52:41Z) - Testing the Ability of Language Models to Interpret Figurative Language [69.59943454934799]
Figurative and metaphorical language are commonplace in discourse.
It remains an open question to what extent modern language models can interpret nonliteral phrases.
We introduce Fig-QA, a Winograd-style nonliteral language understanding task.
arXiv Detail & Related papers (2022-04-26T23:42:22Z) - Expanding Pretrained Models to Thousands More Languages via
Lexicon-based Adaptation [133.7313847857935]
Our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology.
For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively.
arXiv Detail & Related papers (2022-03-17T16:48:22Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - Generalising Multilingual Concept-to-Text NLG with Language Agnostic
Delexicalisation [0.40611352512781856]
Concept-to-text Natural Language Generation is the task of expressing an input meaning representation in natural language.
We propose Language Agnostic Delexicalisation, a novel delexicalisation method that uses multilingual pretrained embeddings.
Our experiments across five datasets and five languages show that multilingual models outperform monolingual models in concept-to-text.
arXiv Detail & Related papers (2021-05-07T17:48:53Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.