Related papers: Transformer-Driven Triple Fusion Framework for Enhanced Multimodal Author Intent Classification in Low-Resource Bangla

Transformer-Driven Triple Fusion Framework for Enhanced Multimodal Author Intent Classification in Low-Resource Bangla

URL: http://arxiv.org/abs/2511.23287v1
Date: Fri, 28 Nov 2025 15:44:42 GMT
Title: Transformer-Driven Triple Fusion Framework for Enhanced Multimodal Author Intent Classification in Low-Resource Bangla
Authors: Ariful Islam, Tanvir Mahmud, Md Rifat Hossen,
Abstract summary: Author intent understanding plays a crucial role in interpreting social media content.<n>This paper addresses author intent classification in Bangla social media posts by leveraging both textual and visual data.<n>We introduce a novel intermediate fusion strategy that significantly outperforms early and late fusion on this task.
Score: 5.518378568494161
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The expansion of the Internet and social networks has led to an explosion of user-generated content. Author intent understanding plays a crucial role in interpreting social media content. This paper addresses author intent classification in Bangla social media posts by leveraging both textual and visual data. Recognizing limitations in previous unimodal approaches, we systematically benchmark transformer-based language models (mBERT, DistilBERT, XLM-RoBERTa) and vision architectures (ViT, Swin, SwiftFormer, ResNet, DenseNet, MobileNet), utilizing the Uddessho dataset of 3,048 posts spanning six practical intent categories. We introduce a novel intermediate fusion strategy that significantly outperforms early and late fusion on this task. Experimental results show that intermediate fusion, particularly with mBERT and Swin Transformer, achieves 84.11% macro-F1 score, establishing a new state-of-the-art with an 8.4 percentage-point improvement over prior Bangla multimodal approaches. Our analysis demonstrates that integrating visual context substantially enhances intent classification. Cross-modal feature integration at intermediate levels provides optimal balance between modality-specific representation and cross-modal learning. This research establishes new benchmarks and methodological standards for Bangla and other low-resource languages. We call our proposed framework BangACMM (Bangla Author Content MultiModal).

Related papers

FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning [52.88164697048371]
We introduce FysicsWorld, the first unified full-modality benchmark that supports bidirectional input-output across image, video, audio, and text.<n>FysicsWorld encompasses 16 primary tasks and 3,268 curated samples, aggregated from over 40 high-quality sources.
arXiv Detail & Related papers (2025-12-14T16:41:29Z)
RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering [50.42577862494645]
We present RAG-IGBench, a benchmark designed to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering.<n>RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content.
arXiv Detail & Related papers (2025-10-11T03:06:39Z)
Two Stage Context Learning with Large Language Models for Multimodal Stance Detection on Climate Change [3.563409707133756]
We propose a multimodal stance detection framework that integrates textual and visual information through a hierarchical fusion approach.<n>Our method first employs a Large Language Model to retrieve stance-relevant summaries from source text, while a domain-aware image caption generator interprets visual content in the context of the target topic.<n>We evaluate our approach on the MultiClimate dataset, a benchmark for climate change-related stance detection containing aligned video frames and transcripts.
arXiv Detail & Related papers (2025-09-09T10:22:10Z)
Cross-Modal Prototype Augmentation and Dual-Grained Prompt Learning for Social Media Popularity Prediction [16.452218354378452]
Social Media Popularity Prediction is a complex task that requires effective integration of images, text, and structured information.<n>We introduce hierarchical prototypes for structural enhancement and contrastive learning for improved vision-text alignment.<n>We propose a feature-enhanced framework integrating dual-grained prompt learning and cross-modal attention mechanisms.
arXiv Detail & Related papers (2025-08-22T07:16:47Z)
HyperFusion: Hierarchical Multimodal Ensemble Learning for Social Media Popularity Prediction [16.78634288864967]
Social media popularity prediction plays a crucial role in content optimization, marketing strategies, and user engagement enhancement across digital platforms.<n>This paper presents HyperFusion, a hierarchical multimodal ensemble learning framework for social media popularity prediction.
arXiv Detail & Related papers (2025-07-01T16:31:50Z)
MMaDA: Multimodal Large Diffusion Language Models [61.13527224215318]
We introduce MMaDA, a novel class of multimodal diffusion foundation models.<n>It is designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation.
arXiv Detail & Related papers (2025-05-21T17:59:05Z)
TriMod Fusion for Multimodal Named Entity Recognition in Social Media [0.0]
We propose a novel approach that integrates textual, visual, and hashtag features (TriMod) for effective modality fusion.<n>We demonstrate the superiority of our approach over existing state-of-the-art methods, achieving significant improvements in precision, recall, and F1 score.
arXiv Detail & Related papers (2025-01-14T17:29:41Z)
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation [59.53678957969471]
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks.<n> generating interleaved image-text content remains a challenge.<n>OpenING is a benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks.<n>IntJudge is a judge model for evaluating open-ended multimodal generation methods.
arXiv Detail & Related papers (2024-11-27T16:39:04Z)
Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.<n>We introduce a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.<n>We propose a simple yet effective Test-time Adaptive Cross-modal (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z)
Uddessho: An Extensive Benchmark Dataset for Multimodal Author Intent Classification in Low-Resource Bangla Language [0.0]
This paper introduces an innovative approach for intent classification in Bangla language, focusing on social media posts. The proposed method leverages multimodal data with particular emphasis on authorship identification. To our best knowledge, this is the first research work on multimodal-based author intent classification for low-resource Bangla language social media posts.
arXiv Detail & Related papers (2024-09-14T18:37:27Z)
Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images. In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS) Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph. We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.