Related papers: Uddessho: An Extensive Benchmark Dataset for Multimodal Author Intent Classification in Low-Resource Bangla Language

Uddessho: An Extensive Benchmark Dataset for Multimodal Author Intent Classification in Low-Resource Bangla Language

URL: http://arxiv.org/abs/2409.09504v1
Date: Sat, 14 Sep 2024 18:37:27 GMT
Title: Uddessho: An Extensive Benchmark Dataset for Multimodal Author Intent Classification in Low-Resource Bangla Language
Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Md. Mahfuzur Rahman, Md Morshed Alam Shanto, Asif Iftekher Fahim, Md. Moinul Hoque,
Abstract summary: This paper introduces an innovative approach for intent classification in Bangla language, focusing on social media posts. The proposed method leverages multimodal data with particular emphasis on authorship identification. To our best knowledge, this is the first research work on multimodal-based author intent classification for low-resource Bangla language social media posts.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the increasing popularity of daily information sharing and acquisition on the Internet, this paper introduces an innovative approach for intent classification in Bangla language, focusing on social media posts where individuals share their thoughts and opinions. The proposed method leverages multimodal data with particular emphasis on authorship identification, aiming to understand the underlying purpose behind textual content, especially in the context of varied user-generated posts on social media. Current methods often face challenges in low-resource languages like Bangla, particularly when author traits intricately link with intent, as observed in social media posts. To address this, we present the Multimodal-based Author Bangla Intent Classification (MABIC) framework, utilizing text and images to gain deeper insights into the conveyed intentions. We have created a dataset named "Uddessho," comprising 3,048 instances sourced from social media. Our methodology comprises two approaches for classifying textual intent and multimodal author intent, incorporating early fusion and late fusion techniques. In our experiments, the unimodal approach achieved an accuracy of 64.53% in interpreting Bangla textual intent. In contrast, our multimodal approach significantly outperformed traditional unimodal methods, achieving an accuracy of 76.19%. This represents an improvement of 11.66%. To our best knowledge, this is the first research work on multimodal-based author intent classification for low-resource Bangla language social media posts.

Related papers

Transformer-Driven Triple Fusion Framework for Enhanced Multimodal Author Intent Classification in Low-Resource Bangla [5.518378568494161]
Author intent understanding plays a crucial role in interpreting social media content.<n>This paper addresses author intent classification in Bangla social media posts by leveraging both textual and visual data.<n>We introduce a novel intermediate fusion strategy that significantly outperforms early and late fusion on this task.
arXiv Detail & Related papers (2025-11-28T15:44:42Z)
MultiBanAbs: A Comprehensive Multi-Domain Bangla Abstractive Text Summarization Dataset [0.0]
In today's digital era, a massive amount of Bangla content is continuously produced across blogs, newspapers, and social media.<n>This creates a pressing need for summarization systems that can reduce information overload and help readers understand content more quickly.<n>This study developed a new Bangla abstractive summarization dataset to generate concise summaries of Bangla articles from diverse sources.
arXiv Detail & Related papers (2025-11-24T17:11:49Z)
Towards Explainable Bilingual Multimodal Misinformation Detection and Localization [64.37162720126194]
BiMi is a framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis.<n>BiMiBench is a benchmark constructed by systematically editing real news images and subtitles.<n>BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore.
arXiv Detail & Related papers (2025-06-28T15:43:06Z)
Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs [13.922091192207718]
This research aims to analyze the relationship among sentences, visuals, and emoticons. We have proposed a novel contrastive learning based multimodal architecture. The proposed model attained an accuracy of 91% and an MCC-score of 90% while assessing emoticons.
arXiv Detail & Related papers (2024-08-05T15:45:59Z)
Multi-modal Crowd Counting via a Broker Modality [64.5356816448361]
Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. We propose a novel approach by introducing an auxiliary broker modality and frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models.
arXiv Detail & Related papers (2024-07-10T10:13:11Z)
TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset. It contains 39,153 text-rich images, captions, and 102,437 questions. We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z)
Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation [53.97962603641629]
We propose a novel mulTi-source sEmantic grAph-based Multimodal sarcasm explanation scheme, named TEAM. TEAM extracts the object-level semantic meta-data instead of the traditional global visual features from the input image. TEAM introduces a multi-source semantic graph that comprehensively characterize the multi-source semantic relations.
arXiv Detail & Related papers (2023-06-29T03:26:10Z)
Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity. The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales. The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
MIntRec: A New Dataset for Multimodal Intent Recognition [18.45381778273715]
Multimodal intent recognition is a significant task for understanding human language in real-world multimodal scenes. This paper introduces a novel dataset for multimodal intent recognition (MIntRec) to address this issue. It formulates coarse-grained and fine-grained intent based on the data collected from the TV series Superstore.
arXiv Detail & Related papers (2022-09-09T15:37:39Z)
Open-Vocabulary Multi-Label Classification via Multi-modal Knowledge Transfer [55.885555581039895]
Multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding. We propose a novel open-vocabulary framework, named multimodal knowledge transfer (MKT) for multi-label classification.
arXiv Detail & Related papers (2022-07-05T08:32:18Z)
MCSE: Multimodal Contrastive Learning of Sentence Embeddings [23.630041603311923]
We propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective. We show that our approach consistently improves the performance across various datasets and pre-trained encoders.
arXiv Detail & Related papers (2022-04-22T21:19:24Z)
Visual Persuasion in COVID-19 Social Media Content: A Multi-Modal Characterization [30.710295617831015]
This work proposes a computational approach to analyze the outcome of persuasive information in multi-modal content. It focuses on two aspects, popularity and reliability, in COVID-19-related news articles shared on Twitter.
arXiv Detail & Related papers (2021-12-05T02:15:01Z)
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post. We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions. Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.