MMT: A Multilingual and Multi-Topic Indian Social Media Dataset
- URL: http://arxiv.org/abs/2304.00634v1
- Date: Sun, 2 Apr 2023 21:39:00 GMT
- Title: MMT: A Multilingual and Multi-Topic Indian Social Media Dataset
- Authors: Dwip Dalal, Vivek Srivastava, Mayank Singh
- Abstract summary: Social media plays a significant role in cross-cultural communication.
A vast amount of this occurs in code-mixed and multilingual form.
We introduce a large-scale multilingual, and multi-topic dataset (MMT) collected from Twitter.
- Score: 1.0413233169366503
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Social media plays a significant role in cross-cultural communication. A vast
amount of this occurs in code-mixed and multilingual form, posing a significant
challenge to Natural Language Processing (NLP) tools for processing such
information, like language identification, topic modeling, and named-entity
recognition. To address this, we introduce a large-scale multilingual, and
multi-topic dataset (MMT) collected from Twitter (1.7 million Tweets),
encompassing 13 coarse-grained and 63 fine-grained topics in the Indian
context. We further annotate a subset of 5,346 tweets from the MMT dataset with
various Indian languages and their code-mixed counterparts. Also, we
demonstrate that the currently existing tools fail to capture the linguistic
diversity in MMT on two downstream tasks, i.e., topic modeling and language
identification. To facilitate future research, we will make the anonymized and
annotated dataset available in the public domain.
Related papers
- M2DS: Multilingual Dataset for Multi-document Summarisation [0.5071800070021028]
Multi-document Summarisation (MDS) has resulted in diverse datasets covering customer reviews, academic papers, medical and legal documents, and news articles.
However, the English-centric nature of these datasets has created a conspicuous void for multilingual datasets in today's globalised digital landscape.
This paper introduces M2DS, emphasising its unique multilingual aspect, and includes baseline scores from state-of-the-art MDS models evaluated on our dataset.
arXiv Detail & Related papers (2024-07-17T06:25:51Z) - Breaking Language Barriers: A Question Answering Dataset for Hindi and
Marathi [1.03590082373586]
This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi.
Despite Hindi being the 3rd most spoken language worldwide, and Marathi being the 11th most spoken language globally, both languages face limited resources for building efficient Question Answering systems.
We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples.
arXiv Detail & Related papers (2023-08-19T00:39:21Z) - Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for
Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems [64.40789703661987]
Multi3WOZ is a novel multilingual, multi-domain, multi-parallel ToD dataset.
It is large-scale and offers culturally adapted dialogs in 4 languages.
We describe a complex bottom-up data collection process that yielded the final dataset.
arXiv Detail & Related papers (2023-07-26T08:29:42Z) - Evaluating Inter-Bilingual Semantic Parsing for Indian Languages [9.838755823660147]
We propose an Inter-bilingual Seq2seq Semantic parsing dataset IE-SEMPARSE for 11 distinct Indian languages.
We highlight the proposed task's practicality, and evaluate existing multilingual seq2seq models across several train-test strategies.
arXiv Detail & Related papers (2023-04-25T17:24:32Z) - MUTANT: A Multi-sentential Code-mixed Hinglish Dataset [16.14337612590717]
We propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles.
As a use case, we leverage multilingual articles and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset.
The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs.
arXiv Detail & Related papers (2023-02-23T04:04:18Z) - MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for
Natural Language Understanding in Task-Oriented Dialogue [115.32009638844059]
We extend the English only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages.
Because of its multi-intent property, MULTI3NLU++ represents complex and natural user goals.
We use MULTI3NLU++ to benchmark state-of-the-art multilingual models for the Natural Language Understanding tasks of intent detection and slot labelling.
arXiv Detail & Related papers (2022-12-20T17:34:25Z) - LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
Translation [94.33019040320507]
Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features.
Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases.
We propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages.
arXiv Detail & Related papers (2022-10-19T12:21:39Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in
Conversations [72.81164101048181]
We propose a dataset for Multimodal Multiparty Hindi Humor (M2H2) recognition in conversations containing 6,191 utterances from 13 episodes of a very popular TV series "Shrimaan Shrimati Phir Se"
Each utterance is annotated with humor/non-humor labels and encompasses acoustic, visual, and textual modalities.
The empirical results on M2H2 dataset demonstrate that multimodal information complements unimodal information for humor recognition.
arXiv Detail & Related papers (2021-08-03T02:54:09Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - NUIG-Shubhanker@Dravidian-CodeMix-FIRE2020: Sentiment Analysis of
Code-Mixed Dravidian text using XLNet [0.0]
Social media has penetrated into multilingual societies, however most of them use English to be a preferred language for communication.
It looks natural for them to mix their cultural language with English during conversations resulting in abundance of multilingual data, call this code-mixed data, available in todays' world.
Downstream NLP tasks using such data is challenging due to the semantic nature of it being spread across multiple languages.
This paper uses an auto-regressive XLNet model to perform sentiment analysis on code-mixed Tamil-English and Malayalam-English datasets.
arXiv Detail & Related papers (2020-10-15T14:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.