Characterising User Content on a Multi-lingual Social Network
- URL: http://arxiv.org/abs/2004.11480v1
- Date: Thu, 23 Apr 2020 22:25:48 GMT
- Title: Characterising User Content on a Multi-lingual Social Network
- Authors: Pushkal Agarwal, Kiran Garimella, Sagar Joglekar, Nishanth Sastry,
Gareth Tyson
- Abstract summary: We present our characterisation of a multilingual social network in India called ShareChat.
We collect an exhaustive dataset across 72 weeks before and during the Indian general elections of 2019 across 14 languages.
We find that Telugu, Malayalam, Tamil and Kannada languages tend to be dominant in soliciting political images.
- Score: 9.13241181020543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Social media has been on the vanguard of political information diffusion in
the 21st century. Most studies that look into disinformation, political
influence and fake-news focus on mainstream social media platforms. This has
inevitably made English an important factor in our current understanding of
political activity on social media. As a result, there has only been a limited
number of studies into a large portion of the world, including the largest,
multilingual and multi-cultural democracy: India. In this paper we present our
characterisation of a multilingual social network in India called ShareChat. We
collect an exhaustive dataset across 72 weeks before and during the Indian
general elections of 2019, across 14 languages. We investigate the cross
lingual dynamics by clustering visually similar images together, and exploring
how they move across language barriers. We find that Telugu, Malayalam, Tamil
and Kannada languages tend to be dominant in soliciting political images (often
referred to as memes), and posts from Hindi have the largest cross-lingual
diffusion across ShareChat (as well as images containing text in English). In
the case of images containing text that cross language barriers, we see that
language translation is used to widen the accessibility. That said, we find
cases where the same image is associated with very different text (and
therefore meanings). This initial characterisation paves the way for more
advanced pipelines to understand the dynamics of fake and political content in
a multi-lingual and non-textual setting.
Related papers
- CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance [53.974497865647336]
We take a first step towards translating images to make them culturally relevant.
We build three pipelines comprising state-of-the-art generative models to do the task.
We conduct a human evaluation of translated images to assess for cultural relevance and meaning preservation.
arXiv Detail & Related papers (2024-04-01T17:08:50Z) - Evolving linguistic divergence on polarizing social media [0.0]
We quantify divergence in topics of conversation and word frequencies, messaging sentiment, and lexical semantics of words and emoji.
While US American English remains largely intelligible within its large speech community, our findings point at areas where miscommunication may arise.
arXiv Detail & Related papers (2023-09-04T15:21:55Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Bridging Nations: Quantifying the Role of Multilinguals in Communication
on Social Media [14.646734380673648]
We quantify multilingual users' structural role and communication influence in cross-lingual information exchange.
Having a multilingual network neighbor increases monolinguals' odds of sharing domains and hashtags from another language 16-fold and 4-fold, respectively.
By highlighting information exchange across borders, this work sheds light on a crucial component of how information and ideas spread around the world.
arXiv Detail & Related papers (2023-04-07T18:01:25Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - Multilingual Abusiveness Identification on Code-Mixed Social Media Text [1.8275108630751844]
We propose an approach for abusiveness identification on the multilingual Moj dataset which comprises of Indic languages.
Our approach tackles the common challenges of non-English social media content and can be extended to other languages as well.
arXiv Detail & Related papers (2022-03-01T12:23:25Z) - M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in
Conversations [72.81164101048181]
We propose a dataset for Multimodal Multiparty Hindi Humor (M2H2) recognition in conversations containing 6,191 utterances from 13 episodes of a very popular TV series "Shrimaan Shrimati Phir Se"
Each utterance is annotated with humor/non-humor labels and encompasses acoustic, visual, and textual modalities.
The empirical results on M2H2 dataset demonstrate that multimodal information complements unimodal information for humor recognition.
arXiv Detail & Related papers (2021-08-03T02:54:09Z) - Sentiment Analysis for Roman Urdu Text over Social Media, a Comparative
Study [0.0]
Roman Urdu is one of most dominant language on social networks in Pakistan and India.
In this article we addressed the prior concepts and strategies used to examine the sentiment of the roman Urdu text.
arXiv Detail & Related papers (2020-10-05T16:19:00Z) - Images and Misinformation in Political Groups: Evidence from WhatsApp in
India [6.421670116083633]
We study a large collection of politically-oriented WhatsApp groups in India, focusing on the period leading up to the 2019 Indian national elections.
By labeling samples of random and popular images, we find that around 13% of shared images are known misinformation.
Machine learning methods can be used to predict whether a viral image is misinformation, but are brittle to shifts in content over time.
arXiv Detail & Related papers (2020-05-19T23:00:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.