COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation
- URL: http://arxiv.org/abs/2506.15372v1
- Date: Wed, 18 Jun 2025 11:38:23 GMT
- Title: COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation
- Authors: Raghvendra Kumar, S. A. Mohammed Salman, Aryan Sahu, Tridib Nandi, Pragathi Y. P., Sriparna Saha, Jose G. Moreno,
- Abstract summary: This study introduces COSMMIC, a comment-sensitive multimodal, multilingual dataset featuring nine major Indian languages.<n> COSMMIC comprises 4,959 article-image pairs and 24,484 reader comments, with ground-truth summaries available in all included languages.<n>To assess the dataset's effectiveness, we employ state-of-the-art language models such as LLama3 and GPT-4.
- Score: 10.9454163542891
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite progress in comment-aware multimodal and multilingual summarization for English and Chinese, research in Indian languages remains limited. This study addresses this gap by introducing COSMMIC, a pioneering comment-sensitive multimodal, multilingual dataset featuring nine major Indian languages. COSMMIC comprises 4,959 article-image pairs and 24,484 reader comments, with ground-truth summaries available in all included languages. Our approach enhances summaries by integrating reader insights and feedback. We explore summarization and headline generation across four configurations: (1) using article text alone, (2) incorporating user comments, (3) utilizing images, and (4) combining text, comments, and images. To assess the dataset's effectiveness, we employ state-of-the-art language models such as LLama3 and GPT-4. We conduct a comprehensive study to evaluate different component combinations, including identifying supportive comments, filtering out noise using a dedicated comment classifier using IndicBERT, and extracting valuable insights from images with a multilingual CLIP-based classifier. This helps determine the most effective configurations for natural language generation (NLG) tasks. Unlike many existing datasets that are either text-only or lack user comments in multimodal settings, COSMMIC uniquely integrates text, images, and user feedback. This holistic approach bridges gaps in Indian language resources, advancing NLP research and fostering inclusivity.
Related papers
- SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset [34.40254709148148]
Code-Switching (CS) is the alternating use of two or more languages within a conversation or utterance.<n>This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems.<n>textbfSwitchLingua is the first large-scale multilingual and multi-ethnic code-switching dataset.
arXiv Detail & Related papers (2025-05-30T05:54:46Z) - Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation [20.109615198034394]
We propose Kaleidoscope as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models.<n>Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions.<n>We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios.
arXiv Detail & Related papers (2025-04-09T17:43:16Z) - COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing [1.3062731746155414]
COMI-LINGUA is the largest manually annotated Hindi-English code-mixed dataset.<n>It comprises 125K+ high-quality instances across five core NLP tasks.<n>Each instance is annotated by three bilingual annotators, yielding over 376K expert annotations.
arXiv Detail & Related papers (2025-03-27T16:36:39Z) - Evaluation of Multilingual Image Captioning: How far can we get with CLIP models? [3.902360015414256]
This work presents several strategies, and extensive experiments, related to evaluating CLIPScore variants in multilingual settings.<n>Tests with machine-translated data show that multilingual CLIPScore models can maintain a high correlation with human judgements across different languages.
arXiv Detail & Related papers (2025-02-10T16:00:00Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods typically align vision encoders with Multimodal Large Language Models (MLLMs) via supervised fine-tuning (SFT)<n>We propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level.<n>We introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions.
arXiv Detail & Related papers (2024-06-04T17:56:28Z) - Align before Attend: Aligning Visual and Textual Features for Multimodal
Hateful Content Detection [4.997673761305336]
This paper proposes a context-aware attention framework for multimodal hateful content detection.
We evaluate the proposed approach on two benchmark hateful meme datasets, viz. MUTE (Bengali code-mixed) and MultiOFF (English)
arXiv Detail & Related papers (2024-02-15T06:34:15Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - GupShup: An Annotated Corpus for Abstractive Summarization of
Open-Domain Code-Switched Conversations [28.693328393260906]
We introduce abstractive summarization of Hindi-English code-switched conversations and develop the first code-switched conversation summarization dataset.
GupShup contains over 6,831 conversations in Hindi-English and their corresponding human-annotated summaries in English and Hindi-English.
We train state-of-the-art abstractive summarization models and report their performances using both automated metrics and human evaluation.
arXiv Detail & Related papers (2021-04-17T15:42:01Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.