BTPD: A Multilingual Hand-curated Dataset of Bengali Transnational Political Discourse Across Online Communities
- URL: http://arxiv.org/abs/2506.06813v1
- Date: Sat, 07 Jun 2025 14:43:35 GMT
- Title: BTPD: A Multilingual Hand-curated Dataset of Bengali Transnational Political Discourse Across Online Communities
- Authors: Dipto Das, Syed Ishtiaque Ahmed, Shion Guha,
- Abstract summary: We present a multilingual dataset of Bengali political discourse (BTPD) collected from three online platforms.<n>This paper also provides a general overview of its topics and multilingual content.
- Score: 25.55378198149251
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding political discourse in online spaces is crucial for analyzing public opinion and ideological polarization. While social computing and computational linguistics have explored such discussions in English, such research efforts are significantly limited in major yet under-resourced languages like Bengali due to the unavailability of datasets. In this paper, we present a multilingual dataset of Bengali transnational political discourse (BTPD) collected from three online platforms, each representing distinct community structures and interaction dynamics. Besides describing how we hand-curated the dataset through community-informed keyword-based retrieval, this paper also provides a general overview of its topics and multilingual content.
Related papers
- Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies [11.52881045684005]
This tutorial is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages.<n>Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages.
arXiv Detail & Related papers (2025-12-16T16:44:17Z) - Awal -- Community-Powered Language Technology for Tamazight [0.21687011163378758]
Awal is a community-powered initiative for developing language technology resources for Tamazight.<n>We analyze 18 months of community engagement, revealing significant barriers to participation.<n>The modest scale of community contributions highlights the limitations of applying standard crowdsourcing approaches to languages with complex sociolinguistic contexts.
arXiv Detail & Related papers (2025-10-31T11:53:05Z) - Probing Politico-Economic Bias in Multilingual Large Language Models: A Cultural Analysis of Low-Resource Pakistani Languages [6.5137518437747]
This paper presents a systematic analysis of political bias in 13 large language models (LLMs) across five low-resource languages spoken in Pakistan.<n>Our method combines quantitative assessment of political orientation across economic (left-right) and social (libertarian-authoritarian) axes with qualitative analysis of framing through content, style, and emphasis.<n>Our results reveal that LLMs predominantly align with liberal-left values, echoing Western training data influences, but exhibit notable shifts toward authoritarian framing in regional languages.
arXiv Detail & Related papers (2025-05-29T15:15:42Z) - Multilingual Topic Classification in X: Dataset and Analysis [19.725017254962918]
We introduce X-Topic, a multilingual dataset featuring content in four distinct languages (English, Spanish, Japanese, and Greek)
Our dataset includes a wide range of topics, tailored for social media content, making it a valuable resource for scientists and professionals working on cross-linguistic analysis.
arXiv Detail & Related papers (2024-10-04T01:37:26Z) - Socially Responsible Data for Large Multilingual Language Models [12.338723881042926]
Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years.
Various efforts are striving for models to accommodate languages of communities outside of the Global North.
arXiv Detail & Related papers (2024-09-08T23:51:04Z) - FREDSum: A Dialogue Summarization Corpus for French Political Debates [26.76383031532945]
We present a dataset of French political debates for the purpose of enhancing resources for multi-lingual dialogue summarization.
Our dataset consists of manually transcribed and annotated political debates, covering a range of topics and perspectives.
arXiv Detail & Related papers (2023-12-08T05:42:04Z) - Multi-EuP: The Multilingual European Parliament Dataset for Analysis of
Bias in Information Retrieval [62.82448161570428]
This dataset is designed to investigate fairness in a multilingual information retrieval context.
It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages.
It offers rich demographic information associated with its documents, facilitating the study of demographic bias.
arXiv Detail & Related papers (2023-11-03T12:29:11Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for
Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems [64.40789703661987]
Multi3WOZ is a novel multilingual, multi-domain, multi-parallel ToD dataset.
It is large-scale and offers culturally adapted dialogs in 4 languages.
We describe a complex bottom-up data collection process that yielded the final dataset.
arXiv Detail & Related papers (2023-07-26T08:29:42Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented
Dialogue Systems [66.92182084456809]
We introduce a novel data curation method that generates GlobalWoZ -- a large-scale multilingual ToD dataset from an English ToD dataset.
Our method is based on translating dialogue templates and filling them with local entities in the target-language countries.
We release our dataset as well as a set of strong baselines to encourage research on learning multilingual ToD systems for real use cases.
arXiv Detail & Related papers (2021-10-14T19:33:04Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z) - DeepHateExplainer: Explainable Hate Speech Detection in Under-resourced
Bengali Language [1.2246649738388389]
We propose an explainable approach for hate speech detection from the under-resourced Bengali language.
In our approach, Bengali texts are first comprehensively preprocessed, before classifying them into political, personal, geopolitical, and religious hates.
Evaluations against machine learning (linear and tree-based models) and deep neural networks (i.e., CNN, Bi-LSTM, and Conv-LSTM with word embeddings) baselines yield F1 scores of 84%, 90%, 88%, and 88%, for political, personal, geopolitical, and religious hates, respectively.
arXiv Detail & Related papers (2020-12-28T16:46:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.