Overcoming Language Disparity in Online Content Classification with
Multimodal Learning
- URL: http://arxiv.org/abs/2205.09744v1
- Date: Thu, 19 May 2022 17:56:02 GMT
- Title: Overcoming Language Disparity in Online Content Classification with
Multimodal Learning
- Authors: Gaurav Verma, Rohit Mujumdar, Zijie J. Wang, Munmun De Choudhury,
Srijan Kumar
- Abstract summary: Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks.
The development of advanced computational techniques and resources is disproportionately focused on the English language.
We explore the promise of incorporating the information contained in images via multimodal machine learning.
- Score: 22.73281502531998
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in Natural Language Processing (NLP) have revolutionized the way
researchers and practitioners address crucial societal problems. Large language
models are now the standard to develop state-of-the-art solutions for text
detection and classification tasks. However, the development of advanced
computational techniques and resources is disproportionately focused on the
English language, sidelining a majority of the languages spoken globally. While
existing research has developed better multilingual and monolingual language
models to bridge this language disparity between English and non-English
languages, we explore the promise of incorporating the information contained in
images via multimodal machine learning. Our comparative analyses on three
detection tasks focusing on crisis information, fake news, and emotion
recognition, as well as five high-resource non-English languages, demonstrate
that: (a) detection frameworks based on pre-trained large language models like
BERT and multilingual-BERT systematically perform better on the English
language compared against non-English languages, and (b) including images via
multimodal learning bridges this performance gap. We situate our findings with
respect to existing work on the pitfalls of large language models, and discuss
their theoretical and practical implications. Resources for this paper are
available at https://multimodality-language-disparity.github.io/.
Related papers
- Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
Lens is a novel approach to enhance multilingual capabilities of large language models (LLMs)
It operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs.
It achieves superior results with much fewer computational resources compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z) - ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot
Multilingual Information Retrieval [10.664434993386523]
Current approaches circumvent the lack of high-quality labeled data in non-English languages.
We present a novel modular dense retrieval model that learns from the rich data of a single high-resource language.
arXiv Detail & Related papers (2024-02-23T02:21:24Z) - xCoT: Cross-lingual Instruction Tuning for Cross-lingual
Chain-of-Thought Reasoning [36.34986831526529]
Chain-of-thought (CoT) has emerged as a powerful technique to elicit reasoning in large language models.
We propose a cross-lingual instruction fine-tuning framework (xCOT) to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2024-01-13T10:53:53Z) - Towards Bridging the Digital Language Divide [4.234367850767171]
multilingual language processing systems often exhibit a hardwired, yet usually involuntary and hidden representational preference towards certain languages.
We show that biased technology is often the result of research and development methodologies that do not do justice to the complexity of the languages being represented.
We present a new initiative that aims at reducing linguistic bias through both technological design and methodology.
arXiv Detail & Related papers (2023-07-25T10:53:20Z) - On the cross-lingual transferability of multilingual prototypical models
across NLU tasks [2.44288434255221]
Supervised deep learning-based approaches have been applied to task-oriented dialog and have proven to be effective for limited domain and language applications.
In practice, these approaches suffer from the drawbacks of domain-driven design and under-resourced languages.
This article proposes to investigate the cross-lingual transferability of using synergistically few-shot learning with prototypical neural networks and multilingual Transformers-based models.
arXiv Detail & Related papers (2022-07-19T09:55:04Z) - A Survey of Multilingual Models for Automatic Speech Recognition [6.657361001202456]
Cross-lingual transfer is an attractive solution to the problem of multilingual Automatic Speech Recognition.
Recent advances in Self Supervised Learning are opening up avenues for unlabeled speech data to be used in multilingual ASR models.
We present best practices for building multilingual models from research across diverse languages and techniques.
arXiv Detail & Related papers (2022-02-25T09:31:40Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Crossing the Conversational Chasm: A Primer on Multilingual
Task-Oriented Dialogue Systems [51.328224222640614]
Current state-of-the-art ToD models based on large pretrained neural language models are data hungry.
Data acquisition for ToD use cases is expensive and tedious.
arXiv Detail & Related papers (2021-04-17T15:19:56Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.