Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts
- URL: http://arxiv.org/abs/2403.16614v1
- Date: Mon, 25 Mar 2024 10:44:38 GMT
- Title: Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts
- Authors: Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera,
- Abstract summary: Tasks such as semantic search and clustering on crisis-related social media texts enhance our comprehension of crisis discourse.
Pre-trained language models have advanced performance in crisis informatics, but their contextual embeddings lack semantic meaningfulness.
We propose multi-lingual sentence encoders that embed crisis-related social media texts for over 50 languages.
- Score: 3.690904966341072
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Tasks such as semantic search and clustering on crisis-related social media texts enhance our comprehension of crisis discourse, aiding decision-making and targeted interventions. Pre-trained language models have advanced performance in crisis informatics, but their contextual embeddings lack semantic meaningfulness. Although the CrisisTransformers family includes a sentence encoder to address the semanticity issue, it remains monolingual, processing only English texts. Furthermore, employing separate models for different languages leads to embeddings in distinct vector spaces, introducing challenges when comparing semantic similarities between multi-lingual texts. Therefore, we propose multi-lingual sentence encoders (CT-XLMR-SE and CT-mBERT-SE) that embed crisis-related social media texts for over 50 languages, such that texts with similar meanings are in close proximity within the same vector space, irrespective of language diversity. Results in sentence encoding and sentence matching tasks are promising, suggesting these models could serve as robust baselines when embedding multi-lingual crisis-related social media texts. The models are publicly available at: https://huggingface.co/crisistransformers.
Related papers
- CReMa: Crisis Response through Computational Identification and Matching of Cross-Lingual Requests and Offers Shared on Social Media [5.384787836425144]
This study addresses the challenge of efficiently identifying and matching assistance requests and offers on social media platforms during emergencies.
We propose CReMa, a systematic approach that integrates textual, temporal, and spatial features for multi-lingual request-offer matching.
We introduce a novel multi-lingual dataset that simulates scenarios of help-seeking and offering assistance on social media across the 16 most commonly used languages in Australia.
arXiv Detail & Related papers (2024-05-20T09:30:03Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts [3.690904966341072]
Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature.
This study introduces CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on an extensive corpus of over 15 billion word tokens from tweets.
arXiv Detail & Related papers (2023-09-11T14:36:16Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling [40.54497836775837]
Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics.
Most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries.
We propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM) to produce more coherent, diverse, and well-aligned topics.
arXiv Detail & Related papers (2023-04-07T08:49:43Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.