Characterizing English Variation across Social Media Communities with
BERT
- URL: http://arxiv.org/abs/2102.06820v1
- Date: Fri, 12 Feb 2021 23:50:57 GMT
- Title: Characterizing English Variation across Social Media Communities with
BERT
- Authors: Li Lucy and David Bamman
- Abstract summary: We analyze two months of English comments in 474 Reddit communities.
The specificity of different sense clusters to a community, combined with the specificity of a community's unique word types, is used to identify cases where a social group's language deviates from the norm.
We find that communities with highly distinctive language are medium-sized, and their loyal and highly engaged users interact in dense networks.
- Score: 9.98785450861229
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Much previous work characterizing language variation across Internet social
groups has focused on the types of words used by these groups. We extend this
type of study by employing BERT to characterize variation in the senses of
words as well, analyzing two months of English comments in 474 Reddit
communities. The specificity of different sense clusters to a community,
combined with the specificity of a community's unique word types, is used to
identify cases where a social group's language deviates from the norm. We
validate our metrics using user-created glossaries and draw on sociolinguistic
theories to connect language variation with trends in community behavior. We
find that communities with highly distinctive language are medium-sized, and
their loyal and highly engaged users interact in dense networks.
Related papers
- ValueScope: Unveiling Implicit Norms and Values via Return Potential Model of Social Interactions [47.85181608392683]
We employ ValueScope to dissect and analyze linguistic and stylistic expressions across 13 Reddit communities.
Our analysis provides a quantitative foundation showing that even closely related communities exhibit remarkably diverse norms.
arXiv Detail & Related papers (2024-07-02T17:51:27Z) - Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit [4.019533549688538]
This paper explores the utilization of code-mixing, a communication strategy prevalent among multilingual speakers, in migration-related discourse on social media platforms such as Reddit.
We present Ensemble Learning for Identification of Code-mixed Texts (ELMICT), a novel approach designed to automatically detect code-mixed messages in migration-related discussions.
arXiv Detail & Related papers (2024-06-12T20:30:34Z) - Echo-chambers and Idea Labs: Communication Styles on Twitter [51.13560635563004]
This paper investigates the communication styles and structures of Twitter (X) communities within the vaccination context.
By shedding light on the nuanced nature of communication within social networks, this study emphasizes the significance of understanding the diversity of perspectives within online communities.
arXiv Detail & Related papers (2024-03-28T13:55:51Z) - Comparing Measures of Linguistic Diversity Across Social Media Language
Data and Census Data at Subnational Geographic Areas [1.0128808054306186]
This paper describes the comparative linguistic ecology of online spaces (i.e., social media language data) and real-world spaces in Aotearoa New Zealand.
We compare measures of linguistic diversity between these different spaces and discuss how social media users align with real-world populations.
arXiv Detail & Related papers (2023-08-21T03:54:23Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Comparing Biases and the Impact of Multilingual Training across Multiple
Languages [70.84047257764405]
We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task.
We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender.
Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture.
arXiv Detail & Related papers (2023-05-18T18:15:07Z) - From words to connections: Word use similarity as an honest signal
conducive to employees' digital communication [0.0]
We analyse the communication of close to 1600 employees, interacting on the intranet communication forum of a large company.
We find that word use similarity is the main driver of interaction, much more than other language characteristics or similarity in network position.
Our results suggest carefully choosing the language according to the target audience and have practical implications for both company managers and online community administrators.
arXiv Detail & Related papers (2021-11-11T10:32:33Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - How individuals change language [1.2437226707039446]
We introduce a very general mathematical model that encompasses a wide variety of individual-level linguistic behaviours.
We compare the likelihood of empirically-attested changes in definite and indefinite articles in multiple languages under different assumptions.
We find that accounts of language change that appeal primarily to errors in childhood language acquisition are very weakly supported by the historical data.
arXiv Detail & Related papers (2021-04-20T19:02:49Z) - The structure of online social networks modulates the rate of lexical
change [7.4037154707453965]
We conduct a large-scale analysis of over 80k neologisms in 4420 online communities across a decade.
Using Poisson regression and survival analysis, our study demonstrates that the community's network structure plays a significant role in lexical change.
arXiv Detail & Related papers (2021-04-11T13:06:28Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.