HebID: Detecting Social Identities in Hebrew-language Political Text
- URL: http://arxiv.org/abs/2508.15483v2
- Date: Sun, 12 Oct 2025 12:44:47 GMT
- Title: HebID: Detecting Social Identities in Hebrew-language Political Text
- Authors: Guy Mor-Lan, Naama Rivlin-Angert, Yael R. Kaplan, Tamir Sheafer, Shaul R. Shenhav,
- Abstract summary: We introduce HebID, the first multilabel Hebrew corpus for social identity detection.<n>We benchmark multilabel and single-label encoders alongside 2B-9B- parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results.<n>We utilize identity choices from a national public survey, enabling a comparison between identities portrayed in elite discourse and the public's identity priorities.
- Score: 1.435381256004719
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Political language is deeply intertwined with social identities. While social identities are often shaped by specific cultural contexts and expressed through particular uses of language, existing datasets for group and identity detection are predominantly English-centric, single-label and focus on coarse identity categories. We introduce HebID, the first multilabel Hebrew corpus for social identity detection: 5,536 sentences from Israeli politicians' Facebook posts (Dec 2018-Apr 2021), manually annotated for twelve nuanced social identities (e.g. Rightist, Ultra-Orthodox, Socially-oriented) grounded by survey data. We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro-$F_1$ = 0.74). We apply our classifier to politicians' Facebook posts and parliamentary speeches, evaluating differences in popularity, temporal trends, clustering patterns, and gender-related variations in identity expression. We utilize identity choices from a national public survey, enabling a comparison between identities portrayed in elite discourse and the public's identity priorities. HebID provides a comprehensive foundation for studying social identities in Hebrew and can serve as a model for similar research in other non-English political contexts.
Related papers
- CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data [56.043078390377076]
We introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain.<n>We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models.<n>We highlight that existing evaluations overestimate LID accuracy for many languages in the web domain.
arXiv Detail & Related papers (2026-01-25T22:49:30Z) - A Tale of Two Identities: An Ethical Audit of Human and AI-Crafted Personas [7.3656495945307086]
Large language models (LLMs) are increasingly used to generate synthetic personas in data-limited domains.<n>This paper audits synthetic personas generated by 3 LLMs through the lens of representational harm, focusing specifically on racial identity.<n>Our findings reveal that LLMs disproportionately foreground racial markers, overproduce culturally coded language, and construct personas that are syntactically elaborate yet narratively reductive.<n>These patterns result in a range of sociotechnical harms, including stereotyping, exoticism, erasure, and benevolent bias, that are often obfuscated by superficially positive narrations.
arXiv Detail & Related papers (2025-05-07T20:12:48Z) - BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages [93.92804151830744]
We present BRIGHTER, a collection of multi-labeled, emotion-annotated datasets in 28 different languages.<n>We highlight the challenges related to the data collection and annotation processes.<n>We show that the BRIGHTER datasets represent a meaningful step towards addressing the gap in text-based emotion recognition.
arXiv Detail & Related papers (2025-02-17T15:39:50Z) - GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models [18.92131015111012]
We introduce GIEBench, a benchmark for empathy evaluation of large language models (LLMs)
GIEBench includes 11 identity dimensions, covering 97 group identities with a total of 999 single-choice questions related to specific group identities.
Our evaluation of 23 LLMs revealed that while these LLMs understand different identity standpoints, they fail to consistently exhibit equal empathy across these identities without explicit instructions to adopt those perspectives.
arXiv Detail & Related papers (2024-06-21T06:50:42Z) - CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models [59.22460740026037]
"CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal impacts" dataset is designed to evaluate the social and cultural variation of Large Language Models (LLMs)
We create a hand-crafted, multilingual dataset of value-laden prompts which address specific socially sensitive topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy.
arXiv Detail & Related papers (2024-05-22T20:19:10Z) - Silver-Tongued and Sundry: Exploring Intersectional Pronouns with ChatGPT [25.5053022752019]
We studied the case of identity simulation through Japanese first-person pronouns.
Pronouns evoke perceptions of social identities in ChatGPT at the intersections of gender, age, region, and formality.
This work highlights the importance of pronoun use for social identity simulation, provides a language-based methodology for culturally-sensitive persona development, and advances the potential of intersectional identities in intelligent agents.
arXiv Detail & Related papers (2024-05-13T23:38:50Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - "I'm fully who I am": Towards Centering Transgender and Non-Binary
Voices to Measure Biases in Open Language Generation [69.25368160338043]
Transgender and non-binary (TGNB) individuals disproportionately experience discrimination and exclusion from daily life.
We assess how the social reality surrounding experienced marginalization of TGNB persons contributes to and persists within Open Language Generation.
We introduce TANGO, a dataset of template-based real-world text curated from a TGNB-oriented community.
arXiv Detail & Related papers (2023-05-17T04:21:45Z) - How Hate Speech Varies by Target Identity: A Computational Analysis [5.746505534720595]
We investigate how hate speech varies in systematic ways according to the identities it targets.
We find that the targeted demographic category appears to have a greater effect on the language of hate speech than does the relative social power of the targeted identity group.
arXiv Detail & Related papers (2022-10-19T19:06:23Z) - Protecting gender and identity with disentangled speech representations [49.00162808063399]
We show that protecting gender information in speech is more effective than modelling speaker-identity information.
We present a novel way to encode gender information and disentangle two sensitive biometric identifiers.
arXiv Detail & Related papers (2021-04-22T13:31:41Z) - Listener's Social Identity Matters in Personalised Response Generation [19.35779310590447]
We investigate how the listener's identity influences the language used in Chinese dialogues on social media.
The experiment results demonstrate that the listener's identity indeed matters in the language use of responses.
By additionally modelling the listener's identity, the personalised response generator performs better in its own identity.
arXiv Detail & Related papers (2020-10-27T14:57:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.